ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
306 stars 90 forks source link

Problem with screen_evaluate #60

Closed carlos-astudillo closed 4 years ago

carlos-astudillo commented 4 years ago

I had a problem while running PGAP and I was not able to find the source of this error.

The following is the part of the cwltool.log

[2020-02-19 13:56:21] INFO [workflow FSCR_Calls_first_pass] completed success [2020-02-19 13:56:21] INFO [step FSCR_Calls_first_pass] completed success [2020-02-19 13:56:21] INFO [workflow vecscreen] starting step screen_evaluate [2020-02-19 13:56:21] INFO [step screen_evaluate] start [2020-02-19 13:56:21] INFO [job screen_evaluate] /tmp/2cv20i8v$ screen_evaluate \ -ifmt \ seq-annot \ -tab \ /tmp/tmpo9x7foja/stg617eb1d3-8aa2-411b-bcf6-dd05dc4bb8a5/calls.tab [2020-02-19 13:56:21] WARNING [job screen_evaluate] completed permanentFail [2020-02-19 13:56:21] WARNING [step screen_evaluate] completed permanentFail [2020-02-19 13:56:21] INFO [workflow vecscreen] completed permanentFail [2020-02-19 13:56:21] WARNING [step vecscreen] completed permanentFail [2020-02-19 13:56:21] INFO [workflow ] completed permanentFail { "gbk": null, "gff": null, "input_fasta": { "class": "File", "location": "file:///pgap/output/scaffolds_5.fasta", "size": 4412918, "basename": "scaffolds_5.fasta", "checksum": "sha1$819756e73ce6be95f9e779a4ba314362f7c4d51b", "path": "/pgap/output/scaffolds_5.fasta" }, "input_submol": { "class": "File", "location": "file:///pgap/output/submol.yaml", "size": 1676, "basename": "submol.yaml", "checksum": "sha1$9bcd9d13fdfcbe339a9da3d8ea0eb88cd307200d", "path": "/pgap/output/submol.yaml" }, "nucleotide_fasta": null, "protein_fasta": null, "sqn": null }

Thank you for any help you can give me

azat-badretdin commented 4 years ago

Carlos, the failure of this step indicates to foreign contamination. There should be *.tab output that lists regions.

thibaudnis commented 4 years ago

If you wish PGAP to run regardless of the suspected contamination run pgap.py with the flag --ignore-all-errors . To see where in the genome the vector/adaptor contamination calls are you need to run pgap.py with -d (debug mode). The coordinates of the suspect regions will be in the file debug/tmp-outdir/*/calls.tab.

carlos-astudillo commented 4 years ago

Carlos, the failure of this step indicates to foreign contamination. There should be *.tab output that lists regions.

Thank you azat-badretdin. I was not able to find any calls.tab file, but as Thibaudnis said, it seems that I need to enable the debug mode. I will run with this flag enabled and check the .tab file. I will let you know the result.

Thanks again

carlos-astudillo commented 4 years ago

If you wish PGAP to run regardless of the suspected contamination run pgap.py with the flag --ignore-all-errors. To see where in the genome the vector/adaptor contamination calls are you need to run pgap.py with -d (debug mode). The coordinates of the suspect regions will be in the file debug/tmp-outdir/*/calls.tab.

Thank you Thibaudnis, I will try this. If there is any region in the calls.tab, what can I do? I will need to extract the DNA again and perform DNA sequencing again?

thibaudnis commented 4 years ago

If there is any region in the calls.tab, what can I do?

After you confirm the calls, you could drop the regions from the assembly by replacing them with gaps (stretches of Ns), or you can try to trim and filter the raw reads better and redo the assembly. That sort of depends on how many flagged regions were found.

carlos-astudillo commented 4 years ago

If you wish PGAP to run regardless of the suspected contamination run pgap.py with the flag --ignore-all-errors. To see where in the genome the vector/adaptor contamination calls are you need to run pgap.py with -d (debug mode). The coordinates of the suspect regions will be in the file debug/tmp-outdir/*/calls.tab.

Thank you Thibaudnis, I will try this. If there is any region in the calls.tab, what can I do? I will need to extract the DNA again and perform DNA sequencing again?

I performed the following step to identify the contamination regions

  1. Run PGAP with the debug flag as suggested by Thibaudnis: ./pgap.py -d -r -o results_out input.yaml
  2. Move to the output directory: cd results_out/debug/tmp-outdir
  3. Find the file calls.tab as suggested by azat-badretdin: find . -name calls.tab -type f ->./y42qtln_/calls.tab
  4. Go to the directory containing the calls.tab file cd /y42qtln_
  5. Show the content of the file: cat calls.tab -> lcl | scaffold_20 X - PHG:phiX174 contam_in_prok

Once the contaminant was detected I used the script at the link below to remove the contaminant https://github.com/htafer/remoVecSec/blob/master/README.md

python3 removeContaminant.py --genomefile scaffolds_5.fasta -c contam_in_prok.fa > scaffolds_5.corr.fasta

After that, the generated draft sequence scaffolds_5.fasta.corr passed the screen_evaluate :D. Actually, the script removed a complete scaffold of almost the same length as the phiX174 sequence length.

Thanks to azat-badretdin and Thibaudnis for the help.

PD: Just for curiosity, someone knows the virus phiX174 and its effect on the bacteria being studied? It seems that it is a virus that may be present in the DNA of bacteria.

thibaudnis commented 4 years ago

Glad this worked. We screen against phiX174 because it is a spike-in control used by Illumina. There is small chance that the phiX174 sequence is naturally occurring in the genome you sequenced, but chances are a lot higher that it is from the spike-in.

carlos-astudillo commented 4 years ago

Based on your comment I found the paper below explained what you said. "Large-scale contamination of microbial isolate genomes by Illumina PhiX control" https://environmentalmicrobiome.biomedcentral.com/articles/10.1186/1944-3277-10-18

Thank you again.