ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
88 stars 12 forks source link

[BUG]: permanentFail during job blast #44

Open egonozer opened 1 year ago

egonozer commented 1 year ago

Describe the bug I am running fcs-adaptor.sh v0.4.0 using Singularity on several genome assemblies generated using SPAdes v3.15.5. Nearly all of them run without error, but I consistently get a permanentFail error for some right after the [job blast] $ vecscreen \ ... command is run. I was getting the same error with fcs-adaptor v0.2.3, but upgrading to v0.4.0 didn't help. When I modified the fcs-adaptor.sh script to include the --debug statement, the last directory in the debug folder named contains the vecscreen.log file which has the following error message:

91501/000/0000/P  CF98656D495AB571 0003/0003 2023-06-23T09:25:28.132666 quser31         UNK_CLIENT      UNK_SESSION              vecscreen Error: VECSCREEN "vecscreen_app.cpp", line 149: CVecScreenApp::Run() --- Blast Error: Near line 0, the local id is too long.  Its length is 51 but the maximum allowed local id length is 50.  Please find and correct all local ids that are too long.

Looking at the file in the debug folder that was used as input for vecscreen, split_fasta.fna, here are the first few sequence headers:

>lcl|NODE_1_length_1090171_cov_39.792984__1_500000
>lcl|NODE_1_length_1090171_cov_39.792984__499501_999500
>lcl|NODE_1_length_1090171_cov_39.792984__999001_1090171
>lcl|NODE_2_length_887790_cov_41.364514__1_500000
>lcl|NODE_2_length_887790_cov_41.364514__499501_887790
>lcl|NODE_3_length_557874_cov_41.529717__1_500000
>lcl|NODE_3_length_557874_cov_41.529717__499501_557874
>NODE_4_length_463023_cov_41.874365
>NODE_5_length_417001_cov_40.711275
>NODE_6_length_369796_cov_39.096108

So it looks like the output of the fasta_split application is generating some sequence IDs that are too long for vecscreen. Is this something that can be modified in fcs-adapator? I really don't want to have to take the step of renaming all my input files with shorter sequence identifiers before running them through fcs_adaptor.

Thanks!

To Reproduce ./run_fcsadaptor.sh --fasta-input ../spades/1148/contigs.fasta --output-dir test --prok --container-engine singularity --image fcsadaptor_0.4.0/fcs-adaptor.sif

Software versions (please complete the following information):

Log Files Attached debug.y6aphyfv.tar.gz

etvedte commented 1 year ago

Hello,

Apologies for the delayed response. We currently have a work ticket documenting this issue and will update this GitHub issue when we have a fix in place. For the timeliest processing of all your assemblies, I would recommend shortening the seqids.

ptrebert commented 3 months ago

@etvedte I believe I am also running into this - can you please update on what the status here is?

ptrebert commented 3 months ago

Following @egonozer example, I also added the debug switch and can confirm that it's the same problem.

I already raised some logging-related remarks in another issue ( #63 ) and I think this also applies here: the log message actually explaining what's wrong should by all means show up in an easily accessible and obvious log file / in captured stderr. It seems that in this case all the relevant log output is send to temporary locations that are being deleted after the job failed.

etvedte commented 3 months ago

Hello @ptrebert,

I looked in to this some more. I was able to reproduce this specific issue. Apparently it only occurs when the vecscreen BLAST search identifies a hit in a chunked region when the original seq_id plus the coordinates from fasta_split exceed 50 characters. In other words, if your sequence is clean, then so long as the original seq_ids are <=50 you should be good. You could try to confirm this observation by shortening the seq_ids and then running FCS-adaptor. If you don't see any contamination calls in that situation, I would be surprised.

As far as remedying this issue, it is a work ticket with lower priority since there is a workaround (i.e. modifying the FASTA headers). Are you using SPAdes, or something else? As far as I'm aware most assemblers should be producing more modest-sized headers. Submissions to NCBI also require seq_ids <= 50 characters. While we do advertise a benefit of these tools is that you can screen without being tied to an NCBI submission, ultimately some of the requirements for submission screening were lifted for these tools.

As far as the logging issue, we can look into how we can better report this. There is the FASTA validator step that prints errors to screen when the original non-split seq_id headers are >50, but as mentioned above this is a different issue.