Describe the bugrun_fcsgx.py fails on highly contaminated sequence (setting tax id to 9606):
26 #top-divs by aggregate coverage (excluding repeats)
27 # is-primary cvg-len-nr top-div-overlap-nr div
28 T 495467 1 prok:CFB group bacteria
29
30
31 Warning: asserted div 'anml:primates' is not represented in the output!
32
33
34
35 ---------------------------------
36 Warning: Asserted tax-div 'anml:primates' is from a different tax-kingdom than the inferred primary set.
37 This means that either asserted tax-div is incorrect, or the input is predominantly contamination.
38 Will trust the asserted div and dismiss the inferred primary set.
39 ---------------------------------
40
41 Traceback (most recent call last):
42 File "/var/tmp/pbs.5806123.hpc-batch/Bazel.runfiles_fc52787n/runfiles/cgr_fcs/apps/private/classify_taxonomy/classify_taxonomy.py", line 351, in <module>
43 sys.exit(main())
44 File "/var/tmp/pbs.5806123.hpc-batch/Bazel.runfiles_fc52787n/runfiles/cgr_fcs/apps/private/classify_taxonomy/classify_taxonomy.py", line 345, in main
45 classify_taxonomy(parser.parse_args())
46 File "/var/tmp/pbs.5806123.hpc-batch/Bazel.runfiles_fc52787n/runfiles/cgr_fcs/apps/private/classify_taxonomy/classify_taxonomy.py", line 302, in classify_taxonomy
47 top_div = selected_divs[0]
48 IndexError: list index out of range
To Reproduce
Supply highly contaminated sequences as input, start with tax id 9606
Software versions (please complete the following information):
See #31
Log Files
Likely not needed
Additional context
For testing purposes, I simply selected a couple of FASTA files that were already excluded from the main assembly by the genome assembler (each FASTA file < 100 MB, the samples are human). Assuming that the assembler was not wrong about not knowing where to place these sequences in the main assembly, it then makes sense that these sequences could be highly contaminated by foreign (non-human) sequences. I nevertheless started the run with the tax id 9606, and of course my expectation is that the run finishes and reports most/all of the input sequences as highly contaminated.
Describe the bug
run_fcsgx.py
fails on highly contaminated sequence (setting tax id to 9606):To Reproduce Supply highly contaminated sequences as input, start with tax id 9606
Software versions (please complete the following information): See #31
Log Files Likely not needed
Additional context For testing purposes, I simply selected a couple of FASTA files that were already excluded from the main assembly by the genome assembler (each FASTA file < 100 MB, the samples are human). Assuming that the assembler was not wrong about not knowing where to place these sequences in the main assembly, it then makes sense that these sequences could be highly contaminated by foreign (non-human) sequences. I nevertheless started the run with the tax id 9606, and of course my expectation is that the run finishes and reports most/all of the input sequences as highly contaminated.
Best, Peter