hyunhwan-jeong / SalmonTE

SalmonTE is an ultra-Fast and Scalable Quantification Pipeline of Transpose Element (TE) Abundances
GNU General Public License v3.0
81 stars 23 forks source link

Input own reference file? #20

Closed oronoc1210 closed 5 years ago

oronoc1210 commented 5 years ago

Hello,

I'm looking to quantify TE expression in Sorghum bicolor, and am currently using TEtranscripts. SalmonTE sounds like a very promising alternative, but was disappointed to find that it only supports Homo sapiens, Mus musculis, Drosophila melanogaster, and Danio rerio. Is it completely impossible to add functionality for inputting your own reference, should it be in the correct format? I currently have a genome assembly fasta, TE annotation fasta and gff3, gene annotation fasta and gff3, so would it not be possible to make a reference file myself from these files and use that as the input instead of one of the four currently accepted references?

Best, Conor

hyunhwan-jeong commented 5 years ago

Hello @oronoc1210,

Thanks for you are interested in SalmonTE, and I am sorry to hear you are having an issue with. I should mention to you is SalmonTE does not need the genome sequence and GFF3. It only needs TE annotation (sequence of TE), and there is a wiki page shows how to create a customized index: https://github.com/hyunhwaj/SalmonTE/wiki/How-to-build-a-customized-index

In other words, you can create your index with the TE annotation (if it really is FASTA of TE sequences) with the step. If you think it is too much to you, I will be glad to add the index to SalmonTE. Please let me know if you want.

Best,

Hyun-Hwan Jeong

oronoc1210 commented 5 years ago

Dear Hyun-Hwan Jeong,

Could you give some input as to whether my TE annotation fasta is formatted correctly? Looking at the wiki page, you want headers to be in the form ">name family/class" with name and family/class being tab-separated. My TE fasta headers look like this: ">EnSpm-N1_SB DNA/CMC-EnSpm" ">Os4_05_6L LTR/Os4_05_6L" ">L1-7_SBi LINE/L1" etc

When I try to build the index, I get the following error:

2018-10-05 20:28:32,797 Building Salmon Index Traceback (most recent call last): File "/usr/local/SalmonTE/SalmonTE.py", line 276, in run(args) File "/usr/local/SalmonTE/SalmonTE.py", line 240, in run build_salmon_index(args['--input_fasta'], args['--ref_name'], args['--te_only']) File "/usr/local/SalmonTE/SalmonTE.py", line 194, in build_salmon_index name, anno = line[1:].strip().split("\t")[:-1] ValueError: not enough values to unpack (expected 2, got 1)

line[1:].strip().split("\t")[:-1] appears to take, for example, ">EnSpm-N1_SB DNA/CMC-EnSpm" and then strip and split it into ["EnSpm-N1_SB", "DNA/CMC-EnSpm"], but the [:-1] removes the "DNA/CMC-EnSpm", giving the ValueError when trying to assign name, anno with just "EnSpm-N1_SB".

How should I be formatting my fasta headers so that index generation works as intended?

Best, Conor

hyunhwan-jeong commented 5 years ago

Hello @oronoc1210 ,

Sorry for your inconvenience. Your formatting is right, and there was a silly mistake in line 194, and this has been resolved in a31dabe82342b96276084606b0e2775145a03c7b. Can you try this again? Please pull this repository before you run the command.

Thank you,

Hyun-Hwan jeong

oronoc1210 commented 5 years ago

Dear Hyun-Hwan jeong,

The correction has solved this issue, but now I'm getting another issue down the line when the salmon index is being built from my fasta file within SalmonTE.

It appears to be passing something different than my fasta file as the argument for salmon index building, as I can successfully build a salmon index by just running salmon index -t salmon index -t Sbicolor_313v31_repeatmasked_assembly_v30_SalmonTEformat.fa -i SalmonTest Results:

Replaced 226877 non-ATCG nucleotides Clipped poly-A tails from 122 transcripts Building rank-select dictionary and saving to disk done Elapsed time: 0.0199624s Writing sequence data to file . . . done Elapsed time: 0.19573s [info] Building 32-bit suffix array (length of generalized text is 473692680) Building suffix array . . .

For context: my goal is to construct a docker image of SalmonTE with my Sorghum bicolor index already built, for portability so that everyone at my lab can easily use Sbicolor as the index right out of the box. Regardless, the issue is when SalmonTE tries to build the Salmon Index from the fasta I copy over into the docker image. Nothing went wrong copying it over, it looks exactly the same as the fasta file that successfully built a salmon index independently above. Might there be an issue in build_salmon_index() ?

The relevant docker build steps and their results:

Step 13/19 : COPY Sbicolor_313v31_repeatmasked_assembly_v30_SalmonTEformat.fa . ---> Using cache ---> 8405d708b078 Step 14/19 : RUN head -n 20 Sbicolor_313v31_repeatmasked_assembly_v30_SalmonTEformat.fa ---> Running in f48e67ab1759

EnSpm-N1_SB DNA/CMC-EnSpm AGCCAGCGGTGGAACATATCTTTTATAGGCGGTTTCAATTAAAGACGCCTATGGAAAGACATTGGCTTACCCGACTAAAAAACCGCTAAATCAATTTTTACAGGCGGTTTTCTAACAAAACTTCCTATAGAAATCATATATTTCTACAGGCGGTTCTCCTAAGAAACCGCCTGTAAAAATCATATTTCTACAGGTGATTTCTTAAAAACCGCCTGTACAAATAATTTGAATTTGAATTTTTTGAGCTTTTCAAATGACCTCGTTTGAAAAAACCGTCAAAATGAAAGTTGTAGATCTTGAAAAGTTATCAAACTTTGTTTGATAATTTTTCATTTGAAATCGTCTTATCATCGAAAACTACGTTTAAATTTCTCAAATTTGAAATTCAAATTTTATAAATGACCTCGGATGGAGAAACTACCAATATAAAAGTTATAGATCTTAAAAAGTTGTAAAACTTTATAGTTGACGATTTTTCCATTTGAATCCGTTTTGGACCTAAAGTAATCAATGTATACTTGATTTAGGATAATATGTGGGGAACTAAACTCTAATATAGACACAACTAAGTGATCGGTGGAGTGGTACACGAGGCTACACGCGAGGGTGAGGTCTCAGGTTCGAATTCCACCGGCCGCGTAGCACGCGATTCTACGTGACCTGCCAGTGGGCCTTTCCCAAGATTAAAATCTTTTAATTTCTTATTTTCAAAAGCCGATTTCATATTTTCTAAAAAAATTTCCACAGGTGGTTGACATAACTGAACCGCTTTTCCACAGGAGGTTCTCAGTTACCCGCTT Os4_05_6L LTR/Os4_05_6L TGGGGAGGCGGCGAGTTAGTAGCTGTAGGCGGCAAAGCAACTTTGTTACCTATTCTATTAGGAACCGCAGATTTTTTAGTCCATCACATTCACGCATTTACCATCCATGTGACTGTATTAATACTTTTGACAGGTGTTTTATTTGCTTGCAGTTCCCGTCTGATACCTGATAAAGCAAATCTTGGCTTTCGCTTCCCTTGCGCCGGACCTGGGCAAGGGGGAACATGTCAAGTATCCGCTCCGGATCATGTTTTCTTGGGTCTATT hAT-N18_SBi DNA/hAT-Tip100 GGTGTAGTCATAAATGGTACTGAACGTTGTTCGTACTGCTCAGTATTTTTTCTGCTTAGTTTTGACTTACACCTAGGGTGTAGAATA EnSpm-N13_SBi DNA/CMC-EnSpm GACATGTAAATTTTGTGAACAATAATGTTGCCACCACTTTATCGGATGAAGAAATGTCCAAAACAAAAGTTATAGATCTTGATAAGTTCTAAAAACCTCTATGTTCATGACTTTTTCAGCTGAAACCATTTAGTGTTCTAAAACGATGTTGAAGTTTCTATTTTTGAAATTCAAAATTCAAATAGATAAAACAAAATCACATATAATGATGGATAAAATAATAGATTTAAGAACATAACAAAATTGCTAGAGCATGATTTTAGATTTTATAGAAAGAATCATTAAATTTGAAGTTAGTATGCAGAAGAAAAACTAGTTACTAGTCTTAGCCAGAGATTAAAAACGGAAA Zm9L_69L LTR/Zm9L_69L TTTAAAGTTTAATTTCAAACAATATTTTGAAGTAGTAAGTGATGTCAAATGAAAATGTTATCAATTACAAACTTTCATAATTTTTTGAGGTCTATAATTTTTATTATAGGTGTTTATCCATTCGAGATCGTTTTGAAAATTCAAATTTTAAATTTTAGATGACGAAATTAATTTTCGTTAGACAGAAGATTACAAATAAAAA EnSpm-N12_SBi DNA/CMC-EnSpm TAAGTGATGTCAAATGAAAATGTTATCAATTACAAACTTTCATAATTTTTTGAGGTCTATAATTTTTATTATAGGTGTTTATCCATTCGAGATCGTTTTGAAAATTCAAATTTTAAATTTTAGATGACGAAATTAATTTTCGTTAGACAGAAGATTACAAATAAAAATATTTGGAGATCCAAATGTTCTCAAATTGAAAAAATTTTGAACTTCGAAGTTGTAGATCTCGTCGAGGACTACAACTTTGATATAAAGTTTGTCTTCATTGGACATCATATG CLOUD DNA/MuDR TACAAATAAAAATATTTGGAGATCCAAATGTTCTCAAATTGAAAAAATTTTGAACTTCGAAGTTGTAGATCTCGTCGAGGACTACAACTTTGATATAAAGTTTGTCTTCATTGGACATCATATGAGAAACTTAATAAATTTTCTT L1-7_SBi LINE/L1 TTTTTCTCGAAAACGCAGGAGAGCTGCGCTTCATTATATTAAGAAGAAAGAAAAGGGCAAGAGCCCAAGCAACGCTACA L1-7_SBi LINE/L1 GAACCCACACTCCCCCTGATTAAGTCTAGGATCACATTCGCGCCCGAGACAAAACAGCACAAGGCGACCACTACAAAATAAAAGACTCTTCTAGCCACCAGGGAGGAGGGCAGTCAGGAATGAAAGCCCTCGAGCCCCAGCCGTAGACCACAGGCAACGTTCCTCACAAGCCGCAATCAAAGCACGACCCAAGTTTGGAGCTGCTCCGTCAAAAACGCACCCATTGCGGTGGTTCCAAATAGTCCAAGCTCCAAGGATAATGAGTGAGTTAATTCCTTGCTTCGTCAACCCTAAACC L1-7_SBi LINE/L1 CTTTGGCACTCTGAACTCAGCCTATATCACTCTCCTTCCTAAAAAGTATGGTGCTGATCAGCCTAAAGATTT Removing intermediate container f48e67ab1759 ---> 672be8ca720e Step 15/19 : RUN SalmonTE.py index --ref_name=Sbi --input_fasta='Sbicolor_313v31_repeatmasked_assembly_v30_SalmonTEformat.fa' --te_only ---> Running in 4e632e29d04b 2018-10-10 19:51:07,907 Building Salmon Index Version Info: ### A newer version of Salmon is available. ####

The newest version, available at https://github.com/COMBINE-lab/salmon/releases contains new features, improvements, and bug fixes; please upgrade at your earliest convenience.

[2018-10-10 19:51:15.966] [jLog] [info] building index RapMap Indexer

[Step 1 of 4] : counting k-mers Elapsed time: 0.0225673s

Replaced 0 non-ATCG nucleotides Clipped poly-A tails from 0 transcripts Building rank-select dictionary and saving to disk done Elapsed time: 0.0002212s Writing sequence data to file . . . done Elapsed time: 0.000238s [info] Building 32-bit suffix array (length of generalized text is 0) Building suffix array . . . FAILURE: return code from libdivsufsort() was -1 2018-10-10 19:51:16,003 Building 'Sbi' index was finished! Removing intermediate container 4e632e29d04b ---> cedb6e1d8464

The file input into salmon index seems to be completely empty, or at the very least in a completely different format such that salmon doesn't recognize any of the data as transcripts. Again, this is odd considering that running salmon index on this fasta outside of SalmonTE works correctly. Do you know what might be going on here?

Best, Conor

hyunhwan-jeong commented 5 years ago

Hello @oronoc1210,

I found this was because your class annotations are from RepeatMasker, not Censor. As you guess, SalmonTE creates new fasta file to create SalmonTE index. If an class annotation of a TE in fasta file is not in the first column of https://github.com/hyunhwaj/SalmonTE/blob/master/scripts/clades_extended.csv, SalmonTE discard the TE sequence from the index building. Since your all your class anotations are not in the file, SalmonTE threw out all sequence, that was the reason of your issue. To resolve this issue, you can replace your annotations or you can add additional TE annotation in https://github.com/hyunhwaj/SalmonTE/blob/master/scripts/clades_extended.csv.

Hope it helps you.

Hyun-Hwan Jeong

hyunhwan-jeong commented 5 years ago

Dear @oronoc1210

I am wondering you are still having the problem. Can you update me about it?

Many Thanks!

Hyun-Hwan Jeong

hyunhwan-jeong commented 5 years ago

I close this issue because it is been a while since I have asked, but no reply.