ncbi / fcs

Foreign Contamination Screening caller scripts and documentation
Other
88 stars 12 forks source link

[FEATURE REQUEST]: Decontaminated contig fasta #27

Closed andigoni closed 1 year ago

andigoni commented 1 year ago

Dear FCS Team,

Thank you for these very much-needed tools. FCS-GX works perfectly well, but is there any function to build the de-contaminated fasta file after filtering out the problematic regions?

Thanks Antigoni

etvedte commented 1 year ago

Hi Antigoni,

Currently we do not have a GX function to clean up identified contaminants, although we do have plans to add that functionality in the future. Do you have access to seqkit or another FASTA parsing tool? As a temporary solution we can certainly add commands to our documentation to help users with this.

Eric

invertome commented 1 year ago

Dear Eric,

I second Antigoni's opinion. I think that the ability to output a decontaminated file and/or example commands to do this with parsing tools would be much appreciated by the community.

Thank you for putting together this great tool!

Cheers, Jorge

etvedte commented 1 year ago

Hi Jorge,

Thank you for the feedback, good to know that multiple members of the community are asking for this. If you have seqkit installed, you could run the following commands to retrieve the contaminant sequence accessions from the GX action report and use seqkit grep to retrieve all the non-contaminant sequences:

grep -w -E 'EXCLUDE|FIX|TRIM' fcs_gx_report.txt | awk '{print $1}' > contam.acc.tmp seqkit grep -v -f contam.acc.tmp genomic.fna > genomic.decontam.fna

Note that this will remove ALL of the sequence containing FIX or TRIM calls. Unfortunately it appears that seqkit grep doesn't support supplying region subsequences for FIX|TRIM within the list. And unlike seqkit grep, seqkit subseq (another subsetting command) supports supplying regions e.g. in BED format but does not allow you to invert the match like seqkit grep. So my best suggestion at the moment is to follow the commands above and then make up a BED file for any non-contaminant sequence in the FIX|TRIM calls and add them back using seqkit grep. Hopefully you don't have many FIX|TRIM calls in your report.

We are currently working on a genome cleaning function as a part of an upcoming GX release.

murphyte commented 1 year ago

FYI, contamination cleanup is now supported in fcs v0.4.0, using fcs.py clean genome See (E) under https://github.com/ncbi/fcs/wiki/FCS-GX

If you run into any problems with the option, please open a new issue. Thanks!