h836472 / ContScout

ContScout sequence contamination filter tool
GNU General Public License v3.0
15 stars 2 forks source link

Would this work for a de novo assembly? #5

Closed Anto007 closed 3 months ago

Anto007 commented 3 months ago

Hi, your new tool looks interesting. I just wanted to know if ContScout will work for a new de novo assembly that I have gff and protein fasta files for? This organism or its close relatives has never been sequenced before and so, corresponding genomes aren't available on public databases. Thanks

h836472 commented 3 months ago

Hi there,

Thank you for your interest in our tool. Having closely related genomes in the db clearly improves the accuracy of the prediction but it is well worth trying the tool even if you have a brand new species with no closely related genomes. I order to run the tool, you will need to pick a taxon Id that tells the program the expected taxonomic lineage of your species. If you do not have this information, you might wish ro screen your draft genome for conserved Eukaryote proteins with Buscco and survey the hits for closest relatives in nr /uniprotkb db. Please let me know if you need any assistance with the run or with the interpretation of the results.

Yours Balazs

On Mon, 8 Jul 2024, 07:37 Jant007, @.***> wrote:

Hi, your new tool looks interesting. I just wanted to know if ContScout will work for a new de novo assembly that I have gff and protein fasta files for? This organism or its close relatives has never been sequenced before and so, corresponding genomes aren't available on public databases. Thanks

— Reply to this email directly, view it on GitHub https://github.com/h836472/ContScout/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL2BSTGDXE2KTKJR5PE7EPDZLIQRZAVCNFSM6AAAAABKQC4VQKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGM4TINRXHA3TIOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Anto007 commented 3 months ago

Thank you @h836472 for your prompt response; I hope to give it a shot and I can specify the taxon id. I have an insect genome assembly but since the specimen also feeds on other insects, I'm worried that the assembly might be contaminated with the genomic fragments from its prey insects. I have also tried to remove all the prokaryotes & fungi from the reads (since these organisms are symbionts of the host insect) before the genome assembly step but if I understood it correctly, ContScout is also capable of identifying & removing prokaryote or fungal contaminants, correct? Do you know if the conda route will work for the native installation of this tool? If so, would you have a recipe .yml file? MMSeqs2 conda version has given me problems previously and so I'm not too sure of proceeding here. The docker-route is my least preferred route and I would like to avoid it as much as possible. Besides, the docker commands documentation seems to be a bit lacking in your User Manual. Thank you very much again

h836472 commented 3 months ago

Hi there,

Containerization (docker / singularity) is meant to be the convenient way for the user community to use ContScout since there are many external dependencies we use and it might be tricky to properly intall them all. On the contrary, containers are ready-made for you and should work out of the box. Please give a try to docker / singularity before trying to build your own conda or native run environment. The important part with containers is that you properly share working and temp directories between your native system and the containerized ContScout. Let me know if you need command examples and I can send a few. I do not have any conda-specific installer script for ContScout at the moment.

Distinguishing prokaryots from insects should work well with the tool. If the prey genome (or a close enough relative) is available, insect versus insect clenout could work too but that could turn out to be a bit of a challenge. Also, the quality of the assembly can influence the cleaning performance. Fragmented assemblies with many small contig might be more troublesome while good quality genomes with a handful of large contig should be easier to resolve. Hope you will find the tool useful for your work,

Balazs

On Mon, 8 Jul 2024, 10:30 Jant007, @.***> wrote:

Thank you @h836472 https://github.com/h836472 for your prompt response; I hope to give it a shot and I can specify the taxon id. I have an insect genome assembly but since the specimen also feeds on other insects, I'm worried that the assembly might be contaminated with the genomic fragments from its prey insects. I have also to remove all the prokaryotes & fungi from the reads (since these organisms are symbionts of the host insect) before the genome assembly step but if I understood it correctly, ContScout is also capable of identifying & removing prokaryote or fungal contaminants, correct? Do you know if the conda route will work for the native installation of this tool? If so, would you have a recipe .yml file? MMSeqs2 conda version has given me problems previously and so I'm not too sure of proceeding here. The docker-route is my least preferred route and I would like to avoid it as much as possible. Besides, the docker commands documentation seems to be a bit lacking in your User Manual. Thank you very much again

— Reply to this email directly, view it on GitHub https://github.com/h836472/ContScout/issues/5#issuecomment-2213360707, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL2BSTBW5MVWUTRRGZ4XZFTZLJEZFAVCNFSM6AAAAABKQC4VQKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJTGM3DANZQG4 . You are receiving this because you were mentioned.Message ID: @.***>

Anto007 commented 3 months ago

Thanks again for your response. Unfortunately, the prey genomes are also not available and I've got a very fragmented assembly from the host insect of interest. Discrimination of contigs between the various insect species level is my current priority and so I will need to think more on this. Closing this thread for now and will open a new ticket in case I run into any issues during installation/running of your tool.