oxfordmmm / gnomonicus

Python code to integrate results of tb-pipeline and provide an antibiogram, mutations and variants
Other
5 stars 0 forks source link

"All arrays must be of the same length" #37

Closed philipwfowler closed 9 months ago

philipwfowler commented 10 months ago

More difficult to diagnose as I don't know which line in the VCF file is triggering this error.

Estimates affects 20 samples out of 44,139.

site.07.subj.277B68C2-382F-4363-9DAB-22EAECE8BBE2.lab.277B68C2-382F-4363-9DAB-22EAECE8BBE2.iso.1.v0.12.4.per_sample.vcf.gz

$ gnomonicus --genome_object packages/tuberculosis_amr_catalogues/catalogues/NC_000962.3/NC_000962.3.gbk --catalogue_file packages/tuberculosis_amr_catalogues/catalogues/NC_000962.3/NC_000962.3_WHO-UCN-GTB-PCI-2021.7_v1.0_GARC1_RUS.csv --csvs all --json --minor_populations minor_alleles.txt --vcf_file /mnt/data/cryptic-release-two/dat/CRyPTIC2/V2/07/277B68C2-382F-4363-9DAB-22EAECE8BBE2/277B68C2-382F-4363-9DAB-22EAECE8BBE2/1/per_sample/site.07.subj.277B68C2-382F-4363-9DAB-22EAECE8BBE2.lab.277B68C2-382F-4363-9DAB-22EAECE8BBE2.iso.1.v0.12.4.per_sample.vcf
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35049/35049 [00:00<00:00, 994528.03it/s]
Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/gnomonicus", line 121, in <module>
    variants = populateVariants(vcfStem, options.output_dir, diff, make_variants_csv, options.resistance_genes, catalogue=resistanceCatalogue)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/gnomonicus/gnomonicus_lib.py", line 175, in populateVariants
    variants = pd.DataFrame(vals).astype(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/pandas/core/frame.py", line 709, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 481, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 115, in arrays_to_mgr
    index = _extract_index(arrays)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/pandas/core/internals/construction.py", line 655, in _extract_index
    raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length
JeremyWesthead commented 10 months ago

I've done some digging as this sample seems to take ~1h to get to this stage, so its been easy to background The issue is that some variants aren't being assigned any vcf evidence. I don't have time to do another run through today, so I'm dumping the nucleotide indices here so I can pick this up on Monday:

3981284
3981285
3981286
3981437
3981438
3981500
3981536
3981622
3981676
3981739

On a side note, the fact that this takes so long is concerning and leads me to think that multithreading and/or rewriting this will be required in the near future. This took 1h and didn't even complete a list of variants, let alone mutations and effects...

JeremyWesthead commented 9 months ago

I've got a fix in place for this now, but the process took >1h and produced an output JSON of ~300MB