Usage of program and memory issue

pebonte commented 4 years ago

Hello,

First thank you for your work. I would like to test your tool on other sub-families of transposable elements so I'm trying to follow the instructions but I have some questions.

I have a single multi-fasta with mm10 genome, multiple fasta with individual chromsomes and a MUSCLE alignment of my list of transposable elements in ClustalW format.

So I ran the construct tool first and then, when trying scan, by providing the complete genome (or just 1 chromosome), the program kill itself. By checking the memory, I saw that it reached +200gb of RAM. I tested with chrF.fa and train.aln, the program worked well. So I think the fasta files are too big for the program (1.18gb for the complete genome, 60-200mb for the individual chromosomes). So I would like to know if there is a way to do it on the whole genome ?

Also, do the scripts in the directory pipeline have to be run after construct and scan or is it a totally separate part ?

The exact commands that I used are (here with the toy_test files): construct -o train.construct.model -u -v train.aln scan -o train.scan.bed -v -c chrF.fa train.construct.model

Thank you for your time and your work.

Best.

Pierre-Emmanuel

mengzhou commented 4 years ago

Hi Pierre-Emmanuel, thank you for trying out this pipeline! For your study, I recommend not using scan for the whole genome, as it was not implemented to handle very large chunk of sequences. And by "large chunk" I mean regions at Mb scale.

My suggestion is using nhmmer on the whole genome first, and use construct to build a profile-HMM on candidate regions identified by nhmmer, which are supposedly at Kb scale. Then you can use scan on those regions to find monomers.

If you have some prior knowledge for the monomer of interest, such as a consensus sequence, you can use hmmbuild in the HMMER suite to construct a profile-HMM with the option --singlemx. This profile-HMM can be used by nhmmer to quickly identify candidate monomer locations in the whole genome. Then you can use the scripts in pipeline for monomer identification in these candidate regions.

The instructions in this Readme file are written for this scenario. Once you have the candidate regions produced by nhmmer, you can follow Step 3 and 4 to generate a refined profile-HMM which can be used for sequence classification.

Hope this helps and please let us know if you have any questions!

pebonte commented 4 years ago

Thank you very much for the fast reply. The process is much more clearer for me now.

I'm currently running nhmmer. As soon as it's finished I will use construct and scan on the candidates regions.

Thanks again and also thanks to Andrew.

Have a nice day.

mengzhou / MonomerAnnotation

Usage of program and memory issue #1