nskvir / RepEnrich

RepEnrich is a method to estimate repetitive element enrichment using high-throughput sequencing data.
26 stars 14 forks source link

Multiple basenames for index #10

Closed rtmag closed 8 years ago

rtmag commented 8 years ago

Hello Developers!

I'm quite keen on using RepEnrich in my analysis. I already created the index using

python ../RepEnrich_setup.py hg19_repeatmasker_clean.txt hg19.fa ./hg19

this generated many indexes inside the hg19 directory that look like :

UCON29.1.ebwt UCON29.2.ebwt UCON29.3.ebwt UCON29.4.ebwt UCON29.fa UCON29.rev.1.ebwt UCON29.rev.2.ebwt X2_LINE.1.ebwt X2_LINE.2.ebwt X2_LINE.3.ebwt X2_LINE.4.ebwt X2_LINE.fa X2_LINE.rev.1.ebwt X2_LINE.rev.2.ebwt X3_LINE.1.ebwt X3_LINE.2.ebwt X3_LINE.3.ebwt X3_LINE.4.ebwt X3_LINE.fa X3_LINE.rev.1.ebwt X3_LINE.rev.2.ebwt X5B_LINE.1.ebwt X5B_LINE.2.ebwt X5B_LINE.3.ebwt X5B_LINE.4.ebwt X5B_LINE.fa X5B_LINE.rev.1.ebwt X5B_LINE.rev.2.ebwt

When I try to run: bowtie /home/rtm/RepEnrich/genomes/hg19 \ -p 6 -t -m 1 -S \ --max /home/rtm/RepEnrich/DMSO_multimap.fastq \ -1 /home/rtm/fastq/HCT_RNA-Seq_data/raw_data/DMSO_HWN2YCCXX_L2_1.fq.gz \ -2 /home/rtm/fastq/HCT_RNA-Seq_data/raw_data/DMSO_HWN2YCCXX_L2_2.fq.gz \ /home/rtm/RepEnrich/DMSO_unique.sam

It raises the error: Could not locate a Bowtie index corresponding to basename "/home/rtm/RepEnrich/genomes/hg19"

If I use as an index one of the transposon families such as /home/rtm/RepEnrich/genomes/hg19/X2_LINE works fine, then the error is because I am not using any of the basename indexes but in the example only the directory containing the multiple indexes is specified:

bowtie /data/mm9 -p 16 -t -m 1 -S --max /data/sampleA_multimap.fastq -1 sample_A_1.fastq -2 sample_A_2.fastq /data/sampleA_unique.sam

any advice?

Thanks a lot in advance!

Best RTM

scriscio commented 8 years ago

Hi,

First try the pre-configured setup available here: https://drive.google.com/folderview?id=0B1dD8MQRH4qZfmlxOGwtSXRnWDFaVldqbkExdXItZGpySm1mVmhlTVladThHWWhGMmxrLTQ&usp=sharing

Then let me know if you continue to have issues. The error you mention indicates that the genome you are using does not have a bowtie to indices not the repeat files.

-Steven

rtmag commented 8 years ago

Dear Steven, Thank you very much for your advice on this issue, however, the error persists.

I just tried a run using the pre-configured setup for hg19: bowtie /home/rtm/myPrograms/RepEnrich/genomes/RepEnrich_setup_hg19 -p 6 -t -m 1 -S --max HCT_DMSO_multimap.fastq -1 HCT_DMSO_HWN2YCCXX_L2_1.fq.gz -2 HCT_DMSO_HWN2YCCXX_L2_2.fq.gz HCT_DMSO_unique.sam

This produces the same error as before: Could not locate a Bowtie index corresponding to basename "/home/rtm/myPrograms/RepEnrich/genomes/RepEnrich_setup_hg19" Overall time: [00:00:00]

Am I suppose to write something as RepEnrich_setup_hg19/* or something alike? Thanks again for the help in advance!

Best, RTM

scriscio commented 8 years ago

Hi,

RepEnrich_setup_hg19 contains the setup for the repetitive element genomes. The first alignment step is an alignment to the genome (hg19.fa). You would need to download the human genome (hg19.fa) and index the human genome with bowtie. Then the bowtie command require the index name for hg19. For more instructions on how to do this check the bowtie1 manual http://bowtie-bio.sourceforge.net/manual.shtml. Use bowtie-build indexer to create the indices.

adomingues commented 8 years ago

Hi @scriscio, I am having the exact same problem. I prepared the repeat_masker annotations with RepEnrich_setup.py which created a folder, Zv9/Sequence/repEnrich, containing multiple bowtie indexes lacking a common prefix:

ZFERV_LTR.rev.1.ebwt ZFERV_I-int.4.ebwt U8.rev.1.ebwt (..)

This leads to RepEnrich.py crashing with the following error:

Preparing for analysis using RepEnrich... /bin/sh: module: command not found Conducting region sorting on unique mapping reads.... Identified 1922unique reads that mapped to repeats. Processing repeat psuedogenomes... Could not locate a Bowtie index corresponding to basename "Zv9/Sequence/repEnrich/HY1" Command: bowtie -k1 -p 1 --quiet Zv9/Sequence/repEnrich/HY1 results/repEnrich/testis-adult-IP.multimap.fastq

(..)

This happens with both bowtie v0.12.8 and bowtie v1.1.1. The weird thing is that RepEnrich does not appear to break and it actually produces some tables of results. These are the last lines of the error message:

Could not locate a Bowtie index corresponding to basename "/home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Sequence/ repEnrich/tRNA-Glu-GAA" Command: bowtie -k1 -p 1 --quiet /home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Sequence/repEnrich/tRNA-Glu-GAA res ults/repEnrich/test.multimap.fastq Sorting and combining the output for both read pairs.... completed ... Writing and processing intermediate files... Identified 55 reads that mapped to repeats for unique and multimappers. Conducting final calculations... Writing final output and removing intermediate files... ... Done

And the results tables are not empty:

head results/repEnrich/test_class_fraction_counts.txt

LTR 1.0 Satellite 0.0 rRNA 41.5 DNA 6.0 snRNA 12.0 Simple_repeat 2.0 tRNA 0.0 Unknown 0.0 scRNA 0.0 ARTEFACT 0.0

So I dug a little deeper and it turns out that not all indexes for each of the transposon sequences are properly generated. Taking that last element in the error message as an example .1.ebwt and .2.ebwt are missing:

ls ~/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Sequence/repEnrich/tRNA-Glu-GAA*

/home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Sequence/repEnrich/tRNA-Glu-GAA.3.ebwt /home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Sequence/repEnrich/tRNA-Glu-GAA.4.ebwt /home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Sequence/repEnrich/tRNA-Glu-GAA.fa

This is not the case for all of elements:

ls /home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Sequence/repEnrich/tRNA-Ala-GCA.*

/home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Sequence/repEnrich/tRNA-Ala-GCA.1.ebwt /home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Sequence/repEnrich/tRNA-Ala-GCA.2.ebwt /home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Sequence/repEnrich/tRNA-Ala-GCA.3.ebwt /home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Sequence/repEnrich/tRNA-Ala-GCA.4.ebwt /home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Sequence/repEnrich/tRNA-Ala-GCA.fa /home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Sequence/repEnrich/tRNA-Ala-GCA.rev.1.ebwt /home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Sequence/repEnrich/tRNA-Ala-GCA.rev.2.ebwt

So I thought this could be some I/O issue with so many indexes being generated. Following RepEnrich guidelines, the simple and low-complexity elements were removed from the repeatmasker table, and new indexes were created from this smaller annotation. Again not all indexes appear to be fully generated, but now most elements seem to be mapped to. This is the sdout:

Preparing for analysis using RepEnrich... Conducting region sorting on unique mapping reads.... Identified 77871unique reads that mapped to repeats. Processing repeat psuedogenomes... Sorting and combining the output for both read pairs.... completed ... Writing and processing intermediate files... Identified 830573 reads that mapped to repeats for unique and multimappers. Conducting final calculations... Writing final output and removing intermediate files... ... Done

the error message:

/bin/sh: module: command not found Could not locate a Bowtie index corresponding to basename "/home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Sequence/repEnrichNoSimpleLow/HY1" Command: bowtie -k1 -p 1 --quiet /home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Sequence/repEnrichNoSimpleLow/HY1 results/repEnrich/gtsf1del4-het-testis-adult-input.multimap.fastq Could not locate a Bowtie index corresponding to basename "/home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Sequence/repEnrichNoSimpleLow/tRNA-Glu-GAA" Command: bowtie -k1 -p 1 --quiet /home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv9/Sequence/repEnrichNoSimpleLow/tRNA-Glu-GAA results/repEnrich/gtsf1del4-het-testis-adult-input.multimap.fastq

and the results:

LTR 9780.52561934 Satellite 3103.93073918 rRNA 673865.39465 DNA 42525.6480314 snRNA 147696.004193 tRNA 8807.4439418 Unknown 827.142333331 scRNA 0.0 ARTEFACT 0.0 DNA? 28.5566195863 Satellite? 55.1419371509 RC 1676.27647326 LINE 12059.9791078 SINE 8017.95635457

Now only HY1 and tRNA-Glu-GAA are missing, something I can live with for now, but it would be nice to know what is going on.

scriscio commented 8 years ago

Hi,

Thank you for the detailed information. This seems like a complex bug I will have to troubleshoot. I have not tried to do the setup for Danio_rerio. This does seem slightly different the issue which indicated the absence of a bowtie1 index for the genome fasta.

Can you send me the Repeatmasker file you prepared to do the setup ?  My email is steven.criscione@gmail.com.  If it is large you could send it over google drive or dropbox link.

-Steven

scriscio commented 8 years ago

Hi,

 We found that the issue was being cause by unassigned contigs included in the RepeatMasker annotation file, while trying to use a genome version which did not include unassigned contigs.  The solution was to use the full genome build for Danio_rerio, which included the unassigned contigs when conducting the setup.

-Steven