Closed YiweiNiu closed 5 years ago
I am happy to hear you are interested in SalmonTE
, @YiweiNiu.
Thanks for this nice tool! I want to try it on my RNA-seq data but the species (Rhesus) is not included in Repbase. So I want to create a FASTA file for SalmonTE.
As you informed, there is no an exact category of Rhesus in Repbase, but I think we can use FASTA file of Primates in your case, and the FASTA of the species contains LTR5_RM - a Long terminal repeats from Rhesus macaque (but only one specified repeat were found in Repbase), so you can give it a shot. Additionally, I can also find an article that shows a genome-wide distribution of retrotransposons (L1 and Alu) between M. mulatta and H. sapiens are similar. This will be the simplest solution we can try.
I've seen this tutorial How to build a customized index but sill don't know how to do. I'm not familiar with RepeatModeler and Censor. It seems that RepeatModeler is for de novo repeat family identification and Censor is for classification, and both are not so easy to use.
This could be a solution, but I would not a comprehensive explanation how we can use now because I was busy for other projects for a while, so I haven't got any change to see those tools. I will update the document with a detailed explanation once I have a chance to use those tools.
Since RepeatMasker website and UCSC Table Browser have provided detiled repeat information (loci, name, class, family) for many species, can we begin from that? For example, extract some specific TE classes and reformat the file for SalmonTE.
This is a good suggestion, but I can give an answer to the question once I figured out how RepeatMasker works. I will let you know once I check this out.
Please let me know if you have additional questions or suggestions.
Thank you,
Hyun-Hwan Jeong
Thank you for your quick reply and helpful suggestions!
I'm not an expert of transposon and just want to check the expression of TEs in my RNA-seq data quickly :)
Your tool seems great, but now I stuck in building index step. I'll try your suggestions first. Thanks again!
@YiweiNiu If this doesn't need to be very accurate, then you'd better use the human reference now, the human reference contains TEs of Primates, so you can quickly check. I am planning to add a reference of Rhesus to SalmonTE
, and this will be included once I decide which solution (among three solutions above) is sufficient. No worries about you are stuck.
Best Regards,
Hyun-Hwan Jeong
That is very kind of you.
So when it doestn't need to ve very accurate, I can just use a adjacent species or a higher taon, is that right? Or this depends on the stituation (the distribution of TEs in these species)?
That is very kind of you.
So when it doestn't need to ve very accurate, I can just use a adjacent species or a higher taon, is that right? Or this depends on the stituation (the distribution of TEs in these species)?
I guess so (with a 90% confidence) but need a check.
Hyun-Hwan Jeong
Hi Hyun-Hwan,
Does SalmonTE
only recognize .fastq
and .fastq.gz
filename extension?
$ ls example
CTRL_1_R1.fastq CTRL_1_R2.fastq CTRL_2_R1.fastq CTRL_2_R2.fastq TARDBP_1_R1.fastq TARDBP_1_R2.fastq TARDBP_2_R1.fastq TARDBP_2_R2.fastq
$ ./SalmonTE.py quant --reference=hs --outpath=res1 example
$ head -3 res1/EXPR.csv
TE,CTRL_1,CTRL_2,TARDBP_1,TARDBP_2
ALU,0.0244029,0.0,0.0,0.0
AluJb,0.348222,0.0,0.0,0.0
# after changing the filename extensions
$ ls example2/
CTRL_1_R1.fq CTRL_1_R2.fq CTRL_2_R1.fq CTRL_2_R2.fq TARDBP_1_R1.fastq TARDBP_1_R2.fastq TARDBP_2_R1.fastq TARDBP_2_R2.fastq
$ ./SalmonTE.py quant --reference=hs --outpath=res2 example2
$ head -3 res2/EXPR.csv
TE,TARDBP_1,TARDBP_2
ALU,0.0,0.0
AluJb,0.0,0.0
In the second run, the CTRL
group was not been processed.
Bests, Yiwei Niu
Does SalmonTE only recognize .fastq and .fastq.gz filename extension?
You're correct. This is related to #4, and currently SalmonTE
supports '.fastq' and '.fastq.gz'. Please use '.fastq' or '.fastq.gz' now, and please let you add your question or comments to #4 because the file extension problem is not relevant to the current issue page.
Thank you,
Hyun-Hwan Jeong
Sorry, I didn't view all other issues.
@YiweiNiu No problem! This is fine. I will let you know when SalmonTE
ready to support the file formats!
Hi, I have the same problem, I need to create an index for a species that is not in Repbase. I have the result of RepeatModeler but the program does not recognize the fasta file.
Would it be possible to update the documentation?
Hello @narojass,
Hi, I have the same problem, I need to create an index for a species that is not in Repbase. I have the result of RepeatModeler but the program does not recognize the fasta file.
Would it be possible to update the documentation?
Could you explain a bit more about that? Did you mention that the fasta file from RepeatModeler cannot be accepted in SalmonTE?
Best,
Hyun-Hwan Jeong
@YiweiNiu, yes, SalmonTe says: "Input file is not to FASTA file". I think it must be the name of the sequence, how can I get the names of hierarchy of classes?
@narojass, Can you share me the FASTA file you have created?
Thanks,
Hyun-Hwan Jeong
@narojass, I did not have a file, could you directly send the file to hyunhwaj@bcm.edu?
Thank you!
Hyun-Hwan Jeong
Hi @hyunhwaj , I encountered the same problem, Input file is not a FASTA file, at the index step. The command I ran was "_SalmonTE.py index --ref_name=TEST --inputfasta=TEST.fa"
The TEST.fa contains sequences like:
>TE_00000000#TIR/EnSpm_CACTA GTTGAACAGTTTAGAATTTGGTCCATTTGGCAAAG.... >TE_00000002#Unknown AAAATTACAAATAAAATCATTCAAA..... >TE00000034#LTR/unknown [sequences] >TE_00000132#TIR/PIF_Harbinger [sequences]
The index step failed with the same error when a genome assembly was used.
How to fix this, please? Thank you.
@CeciliaDeng
I suspect the FASTA file is not properly formatted.
a sequence in the FASTA file should be written as follows:
>B1 SINE1/7SL
gccgggcatggtggcgcacgcctttaatcccagcacttgggaggcagaggcaggcggatttctgagttcg
aggccagcctggtctacanagtgagttccaggacagccagggctacacagagaaaccctgtctcg
It seems that you didn't bring the FASTA file from RepBase, so you would need to properly edit reference/clades_extended.csv
file.
As results, you need to fix your FASTA file like
>TE_00000000 TIR/CACTA
GTTGAACAGTTTAGAATTTGGTCCATTTGGCAAA
>TE_00000002 Unknown
AAAATTACAAATAAAATCATTCAAA.....
>TE_00000132 TIR/PIF_Harbinger
[sequences]
, and you need to modify the clades_extended.csv
like
TIR/CACTA TIR/CACTA DNA transposon Transposable Elements
...
TIR/PIF_Harbinger TIR/PIF_Harbinger DNA transposon Transposable Elements
Let me know if you have further questions.
Hyun-Hwan Jeong
Hi @hyunhwaj , Thank you for your advice. I changed all the seqIDs accordingly. Now runningSalmonTE.py index --ref_name=TEST --input_fasta=test.fa
exited with
2020-11-12 10:54:11,271 Building Salmon Index Traceback (most recent call last): File "/opt/SalmonTE/SalmonTE.py", line 292, in
run(args) File "/opt/SalmonTE/SalmonTE.py", line 248, in run build_salmon_index(args['--input_fasta'], args['--ref_name'], args['--te_only']) File "/opt/SalmonTE/SalmonTE.py", line 178, in build_salmon_index os.mkdir(out_path) OSError: [Errno 30] Read-only file system: '/opt/SalmonTE/reference/GtrNCBI'
Is there an option to specify a different output dir?
SalmonTE currently doesn't have an option to change the directory, but you can locally install SalmonTE to your home folder if you are using a server.
Thank you,
Hyun-Hwan Jeong
Hi Hwan,
Thanks for this nice tool! I want to try it on my RNA-seq data but the species (Rhesus) is not included in Repbase. So I want to create a
FASTA
file forSalmonTE
.I've seen this tutorial How to build a customized index but sill don't know how to do. I'm not familiar with RepeatModeler and Censor. It seems that
RepeatModeler
is for de novo repeat family identification andCensor
is for classification, and both are not so easy to use.Since RepeatMasker website and UCSC Table Browser have provided detiled repeat information (loci, name, class, family) for many species, can we begin from that? For example, extract some specific TE classes and reformat the file for
SalmonTE
.Can this work? I don't know how to reformat the file.
Any suggestions or thoughts would be welcomed.
Bests, Yiwei Niu