Open pebonte opened 4 years ago
Hi Pierre-Emmanuel, thank you for trying out this pipeline! For your study, I recommend not using scan
for the whole genome, as it was not implemented to handle very large chunk of sequences. And by "large chunk" I mean regions at Mb scale.
My suggestion is using nhmmer
on the whole genome first, and use construct
to build a profile-HMM on candidate regions identified by nhmmer
, which are supposedly at Kb scale. Then you can use scan
on those regions to find monomers.
If you have some prior knowledge for the monomer of interest, such as a consensus sequence, you can use hmmbuild
in the HMMER suite to construct a profile-HMM with the option --singlemx
. This profile-HMM can be used by nhmmer
to quickly identify candidate monomer locations in the whole genome. Then you can use the scripts in pipeline
for monomer identification in these candidate regions.
The instructions in this Readme file are written for this scenario. Once you have the candidate regions produced by nhmmer
, you can follow Step 3 and 4 to generate a refined profile-HMM which can be used for sequence classification.
Hope this helps and please let us know if you have any questions!
Thank you very much for the fast reply. The process is much more clearer for me now.
I'm currently running nhmmer. As soon as it's finished I will use construct
and scan
on the candidates regions.
Thanks again and also thanks to Andrew.
Have a nice day.
Hello,
First thank you for your work. I would like to test your tool on other sub-families of transposable elements so I'm trying to follow the instructions but I have some questions.
I have a single multi-fasta with mm10 genome, multiple fasta with individual chromsomes and a MUSCLE alignment of my list of transposable elements in ClustalW format.
So I ran the
construct
tool first and then, when tryingscan
, by providing the complete genome (or just 1 chromosome), the program kill itself. By checking the memory, I saw that it reached +200gb of RAM. I tested withchrF.fa
andtrain.aln
, the program worked well. So I think the fasta files are too big for the program (1.18gb for the complete genome, 60-200mb for the individual chromosomes). So I would like to know if there is a way to do it on the whole genome ?Also, do the scripts in the directory
pipeline
have to be run afterconstruct
andscan
or is it a totally separate part ?The exact commands that I used are (here with the toy_test files):
construct -o train.construct.model -u -v train.aln
scan -o train.scan.bed -v -c chrF.fa train.construct.model
Thank you for your time and your work.
Best.
Pierre-Emmanuel