ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
88 stars 12 forks source link

seqtransform permanentFail : CFastaReader: Seq-id lcl|0-45901 is a duplicate around line 993 #42

Closed scorreard closed 1 year ago

scorreard commented 1 year ago

Describe the bug seqtransform permanentFail

Hi team! Thanks for the tool, I used your tool several times after generating hifiasm assemblies and it worked perfectly, so not an installation issue. This time, I generated an assembly using Flye with both Hifi reads and ONT reads (simplex). I run fcs adaptor before scaffolding. I think the error is due to 2 contigs having the same length, even though they have different sequences. Looking forward your feedback,

Solenne

To Reproduce

/app/fcs/bin/av_screen_x \
    -o output/ \
    --debug --euk \
    input_ont_fastq_1_assembly_consensus.cut250.tigmint.fa.k32.w100.z100.ntLink.scaffolds_cleaned3.fa

I could share the genome with you if needed, but not sure it is necessary

Software versions :

Log Files

Tail of output/fcs_adaptor.log

/projects/cbp/scratch/Monterey_sea_lemon_010/V1/work/78/6f41b71d829668e24d570374e6a55c/output/debug.4dgcxzo_/tmp-outdirqnyhvbc2$ seqtransform \
    -out \
    validated.fna_0.cleaned_fa \
    -in \
    /projects/cbp/scratch/Monterey_sea_lemon_010/V1/work/78/6f41b71d829668e24d570374e6a55c/output/debug.4dgcxzo_/tmpsbhke6mp/stg38e3a4b6-ca70-4d01-964c-9ca0fad363d8/validated.fna_0.fna \
    -seqaction-xml-file \
    /projects/cbp/scratch/Monterey_sea_lemon_010/V1/work/78/6f41b71d829668e24d570374e6a55c/output/debug.4dgcxzo_/tmpsbhke6mp/stg5297ee96-87b6-4f2c-8cd3-c017eb97e817/fcs_calls.xml \
    -report \
    seqtransform.log
[job seqtransform_step] Max memory used: 24MiB
[job seqtransform_step] completed permanentFail
[step seqtransform_step] completed permanentFail
[workflow GenerateCleanedFasta] completed permanentFail
[step GenerateCleanedFasta] completed permanentFail
[workflow ] completed permanentFail
Output will be placed in: /projects/cbp/scratch/Monterey_sea_lemon_010/V1/work/78/6f41b71d829668e24d570374e6a55c/output
Executing the workflow
Traceback (most recent call last):
  File "/projects/cbp/scratch/tmp/Bazel.runfiles_2b1_r3a1/runfiles/cgr_fcs/apps/public/av_screen_x/av_screen_x.py", line 270, in <module>
    sys.exit(main())
  File "/projects/cbp/scratch/tmp/Bazel.runfiles_2b1_r3a1/runfiles/cgr_fcs/apps/public/av_screen_x/av_screen_x.py", line 258, in main
    p.launch()
  File "/projects/cbp/scratch/tmp/Bazel.runfiles_2b1_r3a1/runfiles/cgr_fcs/apps/public/av_screen_x/av_screen_x.py", line 181, in launch
    pipeline(**self.pipeline_args)
  File "/projects/cbp/scratch/tmp/Bazel.runfiles_2b1_r3a1/runfiles/pip_deps_pypi__cwltool_3_1_20211107152837/cwltool/factory.py", line 34, in __call__
    raise WorkflowStatus(out, status)
cwltool.factory.WorkflowStatus: Completed permanentFail

Tail of output/debug.4dgcxzo_/tmp-outdirqnyhvbc2/seqtransform.log

    <msg level='info'  code='No edits'  location='0-45901'>success</msg>
    <msg level='error'  code='bad input format'  location='line 994'>NCBI C++ Exception:&#xa;    T0 &quot;/netopt/ncbi_tools64/c++.by-date/20221028/GCC730-Release64MT/../src/objmgr/uti
l/sequence.cpp&quot;, line 2941: Error: (CObjmgrUtilException::eBadLocation) ncbi::objects::CFastaOstream::x_WriteSeqIds() - Duplicate Seq-id lcl|0-45901 in FASTA output&#xa;</msg>
    <msg level='error'  code='bad input format'  location='lcl|0-45901'>CFastaReader: Seq-id lcl|0-45901 is a duplicate around line 993</msg>
</command-line-tool-report>
grep '0-45901' input_ont_fastq_1_assembly_consensus.cut250.tigmint.fa.k32.w100.z100.ntLink.scaffolds_cleaned3.fa 
>contig_3807::contig_3807:0-45901 None-None
>contig_3898::contig_3898:0-45901 None-None

grep '0-45901' -A 1 input_ont_fastq_1_assembly_consensus.cut250.tigmint.fa.k32.w100.z100.ntLink.scaffolds_cleaned3.fa ==> shows that the 2 sequences are different

Additional context I think it thinks the sequence is duplicated because it has the same coordinates '0-45901' even though they are different contigs?

etvedte commented 1 year ago

Hi Solenne,

It seems to be an issue with your FASTA seq-ids/headers. I made a testing FASTA with the exact headers you included above and got a similar error. When I deleted the trailing None-None from both sequences, it worked. When I deleted one None-None, it also worked. When I replaced None-None with two identical strings following the contigid:coordinates, I got the error.

We may need to post some guidelines about FASTA header formatting if we see more similar issues. If you want to move forward now I would just adjust the headers to make them simpler yet distinct.

Eric

scorreard commented 1 year ago

Thanks Eric, I'll try removing the 'None-None' and will update you later this week!