Analysis of zebrafish genome

joseph-siefert commented 7 years ago

I am trying to use RepEnrich to look at repetitive elements in the zebrafish genome and I am having some issues. I am getting the following error when I run RepEnrich.py, for just about every repeat type in the genome:

Could not locate a Bowtie index corresponding to basename "/Volumes/sansam_lab/Joe/setup_folder_GRCz10_full_rmsk/tRNA-Val-GTG" Command: bowtie -k1 -p 1 --quiet /Volumes/sansam_lab/Joe/setup_folder_GRCz10_full_rmsk/tRNA-Val-GTG SRR870896_multimap.fastq

This seems to be an error in the setup folder. It appears that the program is looking for a repeat for which an index was not created. The setup did not seem to work on a tab delimited rmsk fasta file from UCSC, so I created a custom bed file of the appropriate rows. This seemed to run fine, but there were also some strange errors that occurred, such as:

Command: bowtie-build -f /net/flotsam.nfs/ifs/groups/sansam_lab/Joe/setup_folder_GRCz10_full_rmsk/hAT-N58B_DR.fa /Volumes/sansam_lab/Joe/setup_folder_GRCz10_full_rmsk/hAT-N58B_DR processing repgenome _ATATGCACAT_n.fa (11129 of 11133) -------> 1 fragments Unrecognised Chromosome: chr9 saving repgenome _ATATGCACAT_n.fa (11129 of 11133) indexing repgenome _ATATGCACAT_n.fa (11129 of 11133) Settings: Output files: "/Volumes/sansam_lab/Joe/setup_folder_GRCz10_full_rmsk/_ATATGCACAT_n..ebwt" Line rate: 6 (line is 64 bytes) Lines per side: 1 (side is 64 bytes) Offset rate: 5 (one in 32) FTable chars: 10 Strings: unpacked Max bucket size: default Max bucket size, sqrt multiplier: default Max bucket size, len divisor: 4 Difference-cover sample period: 1024 Endianness: little Actual local endianness: little Sanity checking: disabled Assertions: disabled Random seed: 0 Sizeofs: void:8, int:4, long:8, size_t:8 Input files DNA, FASTA: /net/flotsam.nfs/ifs/groups/sansam_lab/Joe/setup_folder_GRCz10_full_rmsk/_ATATGCACAT_n.fa Reading reference sizes Warning: Encountered empty reference sequence Error: No unambiguous stretches of characters in the input. Aborting... Time reading reference sizes: 00:00:00 Total time for call to driver() for forward index: 00:00:00

There were many, many errors such as this. It's not clear to me how to ensure the setup was compiled correctly. There doesn't seem to be any issues with the mapping with bowtie. The sam and fastq files appear to be good as far as I can tell.

Any help would be much appreciated

nskvir commented 7 years ago

Hi, sorry for the late reply - it seems that notifications aren't automatically being sent to me when issues are submitted for some reason.

The original version of RepEnrich was written to work exclusively with bowtie (as opposed to bowtie2 or newer aligners) - are you still using a bowtie2-indexed genome or bowtie2 for any other components of the analysis?

If not, (and if the '--is_bed TRUE' option was utilized in the command line) this seems like there may be an issue with the annotation file potentially... the chr values are derived from iterating over the column elements in the bedfile, so "Unrecognized Chromosome: chr9" looks like it may be trying to refer to an element whose attributes differ between your bedfile and reference genome, although it is difficult to say without looking at the input files themselves.

We are actually preparing to release RepEnrich2, which will support bowtie2, fairly soon (likely in the coming month) so if the issue persists you may wish to try again using the new version.

joseph-siefert commented 7 years ago

Hi Nicholas,

Thank you for your response. I had to go with the custom because I was unable to get the setup working with the full rmsk file. I think the problem with the rmsk file was it being tab delimited. The bed file is also tab delimited. IS this the proper format? Otherwise the columns and chromosome names should be correct. Unless you can spot something I don't see:

head rmsk_GRCz10_custom.bed

chr1 16773931 16780499 Gypsy51-I_DR LTR Gypsy

chr1 41943017 41943117 DNA-8-14_DR DNA hAT-Charlie

chr1 50331621 50331813 ANGEL DNA hAT

chr1 3145481 3146012 DNA-3-2_DR DNA DNA

chr1 5242647 5243431 DNA2-2_DR DNA DNA

chr1 9436873 9437384 TDR18 DNA TcMar

chr1 10485653 10485795 TDR3 DNA hAT-Charlie

chr1 10485722 10485811 Helitron-2_DR RC Helitron

chr1 10485738 10485827 TE-X-5_DR Satellite Satellite

chr1 13629989 13633992 Mariner-N2_DR DNA TcMar

head GRCz10.fasta

CM002885.1 Danio rerio chromosome 1, GRCz10 reference primary assembly

GATCTTAAACATTTATTCCCCCTGCAAACATTTTCAATCATTACATTGTCATTTCCCCTCCAAATTAAATTTAGCCAGAG

GCGCACAACATACGACCTCTAAAAAAGGTGCTGTAACATGTACCTATATGCAGCACCACTATATGAGAGCGGCATAGCAG

TGTTTAGTCACTTGGTTGCTTTGTTTATATTAACTTGAAAGTGTGTTTTAGCTATTGAGTTTAAACAAAGGGAGCGGTTT

ACATTGAATTAAAGGCAACTACTGATGGGTTGTGTAATGTTTCAAAGAGCTGTTGCAGCATGAGTGGAAAATAAAACCGT

ATTAGTGCTGCCTGGCCCAGTTTGGCACAAAATGGAGCGATTCCATTAAGAGAACGATTCAGCATAAGTGGAACAGCTAA

AGTTTATGAAAATTTTTAATCTGGATGTAGAGAATCTCATAACACAGAAACAGCACTCCTAAAGATGCATTTATACTTCT

GCATAGAGCACACAAGTATGCTTCAGCACAACCTGTGCATGGTCACATAGCCCTTGCTGTGGCAGTCAGACAATGATGCA

CACCTCAATGAATAGAGCAAGCTGTGATTGGTCAGCTTTGTAGAGCTGACGTGTGCCCTAACCTAGTCCTCATCTTCAGT

CAGTAATTTGGGGGCCATGTCTGACAGCAACCTGTGTAAGGCCAGATGAAGCATCTGTAAAACTCAATTTCTTCATCGTA

Thanks so much for your help. I will likely want to do this with the new version once it comes out, but I'd like to try and sort out as much as I can in the meantime.

Best,

Joe

On Fri, Apr 14, 2017 at 5:27 PM, Nicholas Skvir notifications@github.com wrote:

Hi, sorry for the late reply - it seems that notifications aren't automatically being sent to me when issues are submitted for some reason.

The original version of RepEnrich was written to work exclusively with bowtie (as opposed to bowtie2 or newer aligners) - are you still using a bowtie2-indexed genome or bowtie2 for any other components of the analysis?

If not, (and if the '--is_bed TRUE' option was utilized in the command line) this seems like there may be an issue with the annotation file potentially... the chr values are derived from iterating over the column elements in the bedfile, so "Unrecognized Chromosome: chr9" looks like it may be trying to refer to an element whose attributes differ between your bedfile and reference genome, although it is difficult to say without looking at the input files themselves.

We are actually preparing to release RepEnrich2, which will support bowtie2, fairly soon (likely in the coming month) so if the issue persists you may wish to try again using the new version.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nskvir/RepEnrich/issues/16#issuecomment-294248425, or mute the thread https://github.com/notifications/unsubscribe-auth/AZfGiTwwnYb9_glSjvdHRFdxNSBE3O7Wks5rv_LagaJpZM4Mqu0p .

nskvir commented 7 years ago

No problem at all!

So, based on what I can see from the headers it looks like you're doing things correctly - the script is indeed meant to take tab-delimited files, so that shouldn't be a problem. Out of curiosity, how did you generate your custom bedfile? If it was done manually, using excel or something similar for instance, these programs can sometimes have unexpected automatically-applied effects on the formatting and may not be immediately easy to spot.

If you'd like, feel free to email me ( nicholas_skvir@brown.edu ) or share links to the bedfile/fasta reference and I can try to run the setup on our end to see if I fare any better.

Best, Nick

joseph-siefert commented 7 years ago

Hi Nick,

I generated the custom file using awk from the full rmsk file that I downloaded from UCSC. I would send you the files, but I am out of the country and I'm having trouble downloading them from the server. I will try when I get back.

Thanks! Joe

On Tue, Apr 18, 2017 at 12:15 AM, Nicholas Skvir notifications@github.com wrote:

No problem at all!

So, based on what I can see from the headers it looks like you're doing things correctly - the script is indeed meant to take tab-delimited files, so that shouldn't be a problem. Out of curiosity, how did you generate your custom bedfile? If it was done manually, using excel or something similar for instance, these programs can sometimes have unexpected automatically-applied effects on the formatting and may not be immediately easy to spot.

If you'd like, feel free to email me ( nicholas_skvir@brown.edu ) the bedfile/fasta reference and I can try to run the setup on our end to see if I fare any better.

Best, Nick

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nskvir/RepEnrich/issues/16#issuecomment-294518935, or mute the thread https://github.com/notifications/unsubscribe-auth/AZfGia-wJ2ggvqSjS2gxGEVjfKr1kk3_ks5rw5A9gaJpZM4Mqu0p .

adomingues commented 7 years ago

Hi @joseph-siefert,

are you still having problems? I generated the annotations for zebrafish recently and also ran into some problems. This has been fixed now, so if you want I can share my notes with you here. I did it for Zv9 but will do it next week for danRer10 as well.

joseph-siefert commented 7 years ago

I've been out of the country so I haven't had a chance to work on it. I will try as soon as I can and let you know. Might be a few weeks before I can get to it. Thanks!

On Apr 28, 2017, at 4:39 PM, A. Domingues notifications@github.com wrote:

Hi @joseph-siefert,

are you still having problems? I generated the annotations for zebrafish recently and also ran into some problems. This has been fixed now, so if you want I can share my notes with you here. I did it for Zv9 but will do it next week for danRer10 as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

joseph-siefert commented 7 years ago

Hi @adomingues

I'm still having issues, if you wouldn't mind sharing your notes with me that would be very helpful.

Thanks!

adomingues commented 7 years ago

Hi @joseph-siefert ,

I am on holidays at the moment, so I can't reply until later next week. I will do it when I am back though.

Just to give you the heads up.

adomingues commented 7 years ago

Hi @joseph-siefert ,

Here is how I prepared my annotations.

ge="danRer10"
ref_folder="/home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv10/Sequence/repeatMasker"
mkdir -p $ref_folder; 
cd $ref_folder

wget http://www.repeatmasker.org/genomes/danRer10/RepeatMasker-rm406-dfam2.0/danRer10.fa.out.gz
gunzip ${ge}.fa.out.gz
repseq="${ref_folder}/${ge}.fa.out"

# remove low complexity repeats
egrep -v 'Simple_repeat|Low_complexity' $repseq > ${ge}.noSimpleLow.fa.out

BIN='/fsimb/groups/imb-kettinggr/common_bin'
source /fsimb/groups/imb-kettinggr/common_bin/repEnrich/bin/activate
PATH=/fsimb/groups/imb-kettinggr/common_bin/BEDTools/2.23.0/bin:$PATH
export PATH
PATH=/fsimb/groups/imb-kettinggr/common_bin/bowtie/0.12.8:$PATH
export PATH
bowtie --version
# bowtie version 0.12.8

fasta="/home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv10/Sequence/WholeGenomeFastaTopLevel/genome.fa"
outref="/home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv10/Sequence/repEnrichNoSimpleLow"
repseq="/home/adomingu/imb-kettinggr/genomes/Danio_rerio/Ensembl/Zv10/Sequence/repeatMasker/${ge}.noSimpleLow.fa.out"
mkdir -p logs
bsub -q long -app Reserve6G -n1 -J IndexRepEnrich -e logs/IndexRepEnrich.err.log -o logs/IndexRepEnrich.out.log python ${BIN}/RepEnrich/RepEnrich_setup.py $repseq ${fasta} ${outref} --flankinglength 50
## takes a while.

A few notes:

Download sequences from repeatmasker
For me it worked when it contained all scaffolds, that is, not just Chr contigs, but also ‘chrUn_*.
I removed simple repeats because teh index generation was taking too long (and breaking I think), so not worth it for the biological insight we gain by having them.
after the index generation, I just ran repEnrich as in tutorial.

I hope this helps.

joseph-siefert commented 7 years ago

Hi @adomingues,

Thank you very much for sharing. I finally made some progress on this. Turns out I had an issue with the genome indexing that was messing everything up.

Now I have a new issue though that maybe you can help me with. The setup and mapping seem to have run fine, but when I run the RepEnrich.py I get the following error:

IOError: [Errno 2] No such file or directory: '/setup/repgenomes_key.txt'

The repgenomes_key.txt file does not seem to have been created during the setup. Any advice on this?

I really appreciate your help.

Best, Joe

adomingues commented 7 years ago

Hi @joseph-siefert , No problem. Regarding the new issue, no idea how to solve it because I did not run into that problem. It could just some confusion with the paths, and the file got generated elsewhere. Maybe @nskvir can help.

Good luck.

nskvir commented 7 years ago

Hey @joseph-siefert

If there were no errors thrown during the setup steps I'm inclined to agree that your new issue is perhaps due to some mixup with the path specification -- if you check your setup folder (within the annotation folder) it should contain the repgenomes_key.txt file and repnames.bed file as well as all of the pseudogenome .ebwt/.fa extension files.

Best, Nick

nskvir / RepEnrich

Analysis of zebrafish genome #16