ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
88 stars 12 forks source link

[BUG]: run_fcsgx.py IndexError if asserted tax-div != inferred primary #35

Closed ptrebert closed 1 year ago

ptrebert commented 1 year ago

Describe the bug run_fcsgx.py fails on highly contaminated sequence (setting tax id to 9606):

     26 #top-divs by aggregate coverage (excluding repeats)
     27 #       is-primary      cvg-len-nr      top-div-overlap-nr      div
     28         T       495467  1       prok:CFB group bacteria
     29 
     30 
     31 Warning: asserted div 'anml:primates' is not represented in the output!
     32 
     33 
     34 
     35 ---------------------------------
     36 Warning: Asserted tax-div 'anml:primates' is from a different tax-kingdom than the inferred primary set.
     37 This means that either asserted tax-div is incorrect, or the input is predominantly contamination.
     38 Will trust the asserted div and dismiss the inferred primary set.
     39 ---------------------------------
     40 
     41 Traceback (most recent call last):
     42   File "/var/tmp/pbs.5806123.hpc-batch/Bazel.runfiles_fc52787n/runfiles/cgr_fcs/apps/private/classify_taxonomy/classify_taxonomy.py", line 351, in <module>
     43     sys.exit(main())
     44   File "/var/tmp/pbs.5806123.hpc-batch/Bazel.runfiles_fc52787n/runfiles/cgr_fcs/apps/private/classify_taxonomy/classify_taxonomy.py", line 345, in main
     45     classify_taxonomy(parser.parse_args())
     46   File "/var/tmp/pbs.5806123.hpc-batch/Bazel.runfiles_fc52787n/runfiles/cgr_fcs/apps/private/classify_taxonomy/classify_taxonomy.py", line 302, in classify_taxonomy
     47     top_div = selected_divs[0]
     48 IndexError: list index out of range

To Reproduce Supply highly contaminated sequences as input, start with tax id 9606

Software versions (please complete the following information): See #31

Log Files Likely not needed

Additional context For testing purposes, I simply selected a couple of FASTA files that were already excluded from the main assembly by the genome assembler (each FASTA file < 100 MB, the samples are human). Assuming that the assembler was not wrong about not knowing where to place these sequences in the main assembly, it then makes sense that these sequences could be highly contaminated by foreign (non-human) sequences. I nevertheless started the run with the tax id 9606, and of course my expectation is that the run finishes and reports most/all of the input sequences as highly contaminated.

Best, Peter

pstrope commented 1 year ago

Thanks for the report. We expect it to be fixed in the upcoming release. Please reopen the ticket if the issue persists.