ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
88 stars 12 forks source link

action_report fails with IndexError #21

Closed reslp closed 1 year ago

reslp commented 1 year ago

Hi,

I have been analyzing another genome (also from a flatworm) than the one I mentioned in #20 . This time, the pipeline failed at the action_report step. I can't find an indication in the *.rpt as to why it would fail. The only thing I noticed is that the *.rpt file is quite large: 30MB with ~175000 lines indicating many potential contaminations.

Here is the output of the pipeline:

--------------------------------------------------------------------

tax-id    : 142782
fasta     : results/Chimaericola_leptogaster/ASSEMBLY_CLEANUP/FCS-ADAPTOR/Chimaericola_leptogaster_cleaned.fcs.adaptor.fas
size      : 637.32 MiB
split-fa  : True
BLAST-div : flatworms
gx-div    : anml:worms
w/same-tax: True
bin-dir   : /app/bin
gx-db     : /cl_tmp/reslph/projects/annocomba-monos/data/ncbi-fcs-db/db/all
output    : results/Chimaericola_leptogaster/ASSEMBLY_CLEANUP/FCS-FOREIGNSEQS/Chimaericola_leptogaster_cleaned.fcs.adaptor.142782.taxonomy.rpt

--------------------------------------------------------------------

Prefetched memory-mapped pages in 20255.4s; 0.0157504 GB/s.
Collecting masking statistics...
Collected masking stats:  0.657063 Gbp; 41.7207s; 15.7491 Mbp/s. Baseline: 1.57106

Prefetched memory-mapped pages in 10574.1s; 0.0167474 GB/s.
Processed 175102 queries, 653.899Mbp in 4061.66s. (0.160993Mbp/s)
Source file results/Chimaericola_leptogaster/ASSEMBLY_CLEANUP/FCS-FOREIGNSEQS/Chimaericola_leptogaster_cleaned.fcs.adaptor.142782.taxonomy.rpt.tmp

 * * * Asserted tax-div, anml:worms, conflicts with highest-coverage one, anml:fishes. * * * 

primary-divs: ['anml:worms'] (6%)
Top represented divs:
    anml:fishes         2701208 bp
    anml:molluscs       1784689 bp
    anml:crustaceans    1536120 bp
    anml:insects         766202 bp
    anml:worms           660652 bp

 * * * Warning: Fraction of primary-div = 6%, below 66% * * * 

Aggregate coverage: 10%
Low-coverage mode. The following contaminant divs are out of scope
anml:rotifers : len:11365; div-cvg:34%

Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_e1kfuuj4/runfiles/cgr_fcs/apps/private/action_report/action_report.py", line 1219, in <module>
    sys.exit(main())
  File "/tmp/Bazel.runfiles_e1kfuuj4/runfiles/cgr_fcs/apps/private/action_report/action_report.py", line 1214, in main
    return action_report(args)
  File "/tmp/Bazel.runfiles_e1kfuuj4/runfiles/cgr_fcs/apps/private/action_report/action_report.py", line 1116, in action_report
    seq.calc_actions(primary_div_name, low_cov_mode, thresholds)
  File "/tmp/Bazel.runfiles_e1kfuuj4/runfiles/cgr_fcs/apps/private/action_report/action_report.py", line 743, in calc_actions
    s = spans[first_contam_index]
IndexError: list index out of range
Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_hgo_mwz5/runfiles/cgr_fcs/apps/private/run_gx/run_gx.py", line 786, in <module>
    main()
  File "/tmp/Bazel.runfiles_hgo_mwz5/runfiles/cgr_fcs/apps/private/run_gx/run_gx.py", line 767, in main
    run_classify_taxonomy_and_action_report(args)
  File "/tmp/Bazel.runfiles_hgo_mwz5/runfiles/cgr_fcs/apps/private/run_gx/run_gx.py", line 496, in run_classify_taxonomy_and_action_report
    run(
  File "/tmp/Bazel.runfiles_hgo_mwz5/runfiles/cgr_fcs/apps/private/run_gx/run_gx.py", line 484, in run
    subprocess.run(cmd, stdout=out_file, check=True, stderr=sys.stderr)
  File "/usr/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/app/bin/action_report', '--in=results/Chimaericola_leptogaster/ASSEMBLY_CLEANUP/FCS-FOREIGNSEQS/Chimaericola_leptogaster_cleaned.fcs.adaptor.142782.taxonomy.rpt']' returned non-zero exit status 1.

I would be very happy for any suggestions on how to go about this.

kind regards,

Philipp

reslp commented 1 year ago

A quick addition. I reran the command with --debug. This is where it fails:

----
1:  inconclusive:   2   lcl|13797536:   1 .. 12106    :  0 after grey merge
span_records len:1
last span len:0
pending_greys len:0
curr result: inconclusive  start_pos:1
----
span_records len:1
last span len:0
pending_greys len:1
curr result: inconclusive  start_pos:8170
----
span_records len:1
last span len:0
pending_greys len:2
curr result: low-coverage  start_pos:10649
----
1:  inconclusive:   3   lcl|13797537:   1 .. 20674    :  802 after grey merge
Seqid lcl|13689133 back trim from 1362 to 2596
Seqid lcl|13755117 back trim from 10008 to 13495
Traceback (most recent call last):
  File "/tmp/Bazel.runfiles_yhawpeok/runfiles/cgr_fcs/apps/private/action_report/action_report.py", line 1219, in <module>
    sys.exit(main())
  File "/tmp/Bazel.runfiles_yhawpeok/runfiles/cgr_fcs/apps/private/action_report/action_report.py", line 1214, in main
    return action_report(args)
  File "/tmp/Bazel.runfiles_yhawpeok/runfiles/cgr_fcs/apps/private/action_report/action_report.py", line 1116, in action_report
    seq.calc_actions(primary_div_name, low_cov_mode, thresholds)
  File "/tmp/Bazel.runfiles_yhawpeok/runfiles/cgr_fcs/apps/private/action_report/action_report.py", line 743, in calc_actions
    s = spans[first_contam_index]
IndexError: list index out of range

The entries in the *.rpt file which occur at the very end in the output look like this:

cat Chimaericola_leptogaster_cleaned.fcs.adaptor.142782.taxonomy.rpt | grep "lcl|13689133"
lcl|13689133~~1..1361   1361    703,0,0,0   1337    |   Procambarus clarkii 6728    anml:crustaceans    1326    1326    46  |   7739    anml:fishes 278 278 20  |   6526    anml:molluscs   242 242 18  |   32201   plnt:plants 190 190 17  |   3   primary-div anml:crustaceans    97
lcl|13689133~~1362..2032    671 201,0,0,0   633 |   Biomphalaria glabrata   6526    anml:molluscs   497 497 26  |   7739    anml:fishes 456 456 25  |   1735272 anml:molluscs   497 43  8   |                       |   3   contaminant anml:molluscs   74
lcl|13689133~~2033..2596    564 563,0,0,0   169 |   Biomphalaria glabrata   6526    anml:molluscs   169 169 15  |   7739    anml:fishes 135 135 14
cat Chimaericola_leptogaster_cleaned.fcs.adaptor.142782.taxonomy.rpt | grep "lcl|13755117"
lcl|13755117~~1..10007  10007   6436,0,0,0  2373    |   Salmo salar 8030    anml:fishes 1993    1915    64  |   6454    anml:molluscs   862 827 37  |45351  anml:basal metazoans    622 608 32  |   29144   anml:fishes 1993    480 26  |   4   primary-div anml:fishes 20
lcl|13755117~~10008..12355  2348    0,0,80,0    2229    |   Haliotis rufescens  6454    anml:molluscs   1404    1404    51  |   1187980 anml:nematodes  12291158    44  |   294128  anml:crustaceans    762 731 37  |   51707   plnt:green algae    308 308 24  |   3   contaminant anml:molluscs   60
lcl|13755117~~12356..13495  1140    729,0,0,0   13  |   Chanos chanos   29144   anml:fishes 13  13  5   |                       ||
cat Chimaericola_leptogaster_cleaned.fcs.adaptor.142782.taxonomy.rpt | grep "lcl|13797537"
lcl|13797537~1..8057    8057    2461,328,0,0    560 |   Biomphalaria glabrata   6526    anml:molluscs   290 202 17  |   36100   anml:molluscs   290 168 16  |   2065413 anml:insects    170 81  11  |   931172  anml:insects    170 53  8   |   0   inconclusive    none    0
lcl|13797537~8170..10568    2399    557,0,0,0   241 |   Python bivittatus   176946  anml:reptiles   124 124 13  |   2607531 anml:molluscs   45  45  9   |   482537  anml:mammals    38  38  8   |   7113    anml:insects    26  26  6   |   0   inconclusive    none    0
lcl|13797537~10649..20674   10026   1101,0,100,0    2817    |   Pollicipes pollicipes   41117   anml:crustaceans    774 763 43  |   2743191 anml:echinoderms    857 799 31  |   8030    anml:fishes 671 540 30  |   120017  fung:basidiomycetes 235 164 14  |   2   low-coverageanml:crustaceans    8
pstrope commented 1 year ago

Hi Philipp,

We recently updated FCS-GX to v0.3.0. Could you please re-run using the newest image and run script, and see if this problem goes away?

Thank you! Pooja

pstrope commented 1 year ago

Closing. Please follow-up if you have other questions.