hyunhwan-jeong / SalmonTE

SalmonTE is an ultra-Fast and Scalable Quantification Pipeline of Transpose Element (TE) Abundances
GNU General Public License v3.0
81 stars 23 forks source link

build index from species which is not available in Repbase #9

Closed YiweiNiu closed 5 years ago

YiweiNiu commented 6 years ago

Hi Hwan,

Thanks for this nice tool! I want to try it on my RNA-seq data but the species (Rhesus) is not included in Repbase. So I want to create a FASTA file for SalmonTE.

I've seen this tutorial How to build a customized index but sill don't know how to do. I'm not familiar with RepeatModeler and Censor. It seems that RepeatModeler is for de novo repeat family identification and Censor is for classification, and both are not so easy to use.

Since RepeatMasker website and UCSC Table Browser have provided detiled repeat information (loci, name, class, family) for many species, can we begin from that? For example, extract some specific TE classes and reformat the file for SalmonTE.

Can this work? I don't know how to reformat the file.

Any suggestions or thoughts would be welcomed.

Bests, Yiwei Niu

hyunhwan-jeong commented 6 years ago

I am happy to hear you are interested in SalmonTE, @YiweiNiu.

Thanks for this nice tool! I want to try it on my RNA-seq data but the species (Rhesus) is not included in Repbase. So I want to create a FASTA file for SalmonTE.

As you informed, there is no an exact category of Rhesus in Repbase, but I think we can use FASTA file of Primates in your case, and the FASTA of the species contains LTR5_RM - a Long terminal repeats from Rhesus macaque (but only one specified repeat were found in Repbase), so you can give it a shot. Additionally, I can also find an article that shows a genome-wide distribution of retrotransposons (L1 and Alu) between M. mulatta and H. sapiens are similar. This will be the simplest solution we can try.

I've seen this tutorial How to build a customized index but sill don't know how to do. I'm not familiar with RepeatModeler and Censor. It seems that RepeatModeler is for de novo repeat family identification and Censor is for classification, and both are not so easy to use.

This could be a solution, but I would not a comprehensive explanation how we can use now because I was busy for other projects for a while, so I haven't got any change to see those tools. I will update the document with a detailed explanation once I have a chance to use those tools.

Since RepeatMasker website and UCSC Table Browser have provided detiled repeat information (loci, name, class, family) for many species, can we begin from that? For example, extract some specific TE classes and reformat the file for SalmonTE.

This is a good suggestion, but I can give an answer to the question once I figured out how RepeatMasker works. I will let you know once I check this out.

Please let me know if you have additional questions or suggestions.

Thank you,

Hyun-Hwan Jeong

YiweiNiu commented 6 years ago

Thank you for your quick reply and helpful suggestions!

I'm not an expert of transposon and just want to check the expression of TEs in my RNA-seq data quickly :)

Your tool seems great, but now I stuck in building index step. I'll try your suggestions first. Thanks again!

hyunhwan-jeong commented 6 years ago

@YiweiNiu If this doesn't need to be very accurate, then you'd better use the human reference now, the human reference contains TEs of Primates, so you can quickly check. I am planning to add a reference of Rhesus to SalmonTE, and this will be included once I decide which solution (among three solutions above) is sufficient. No worries about you are stuck.

Best Regards,

Hyun-Hwan Jeong

YiweiNiu commented 6 years ago

That is very kind of you.

So when it doestn't need to ve very accurate, I can just use a adjacent species or a higher taon, is that right? Or this depends on the stituation (the distribution of TEs in these species)?

hyunhwan-jeong commented 6 years ago

That is very kind of you.

So when it doestn't need to ve very accurate, I can just use a adjacent species or a higher taon, is that right? Or this depends on the stituation (the distribution of TEs in these species)?

I guess so (with a 90% confidence) but need a check.

Hyun-Hwan Jeong

YiweiNiu commented 6 years ago

Hi Hyun-Hwan,

Does SalmonTE only recognize .fastq and .fastq.gz filename extension?

$ ls example
CTRL_1_R1.fastq  CTRL_1_R2.fastq  CTRL_2_R1.fastq  CTRL_2_R2.fastq  TARDBP_1_R1.fastq  TARDBP_1_R2.fastq  TARDBP_2_R1.fastq  TARDBP_2_R2.fastq
$ ./SalmonTE.py quant --reference=hs --outpath=res1 example
$ head -3 res1/EXPR.csv 
TE,CTRL_1,CTRL_2,TARDBP_1,TARDBP_2
ALU,0.0244029,0.0,0.0,0.0
AluJb,0.348222,0.0,0.0,0.0

# after changing the filename extensions
$ ls example2/
CTRL_1_R1.fq  CTRL_1_R2.fq  CTRL_2_R1.fq  CTRL_2_R2.fq  TARDBP_1_R1.fastq  TARDBP_1_R2.fastq  TARDBP_2_R1.fastq  TARDBP_2_R2.fastq
$ ./SalmonTE.py quant --reference=hs --outpath=res2 example2
$ head -3 res2/EXPR.csv 
TE,TARDBP_1,TARDBP_2
ALU,0.0,0.0
AluJb,0.0,0.0

In the second run, the CTRL group was not been processed.

Bests, Yiwei Niu

hyunhwan-jeong commented 6 years ago

Does SalmonTE only recognize .fastq and .fastq.gz filename extension?

You're correct. This is related to #4, and currently SalmonTE supports '.fastq' and '.fastq.gz'. Please use '.fastq' or '.fastq.gz' now, and please let you add your question or comments to #4 because the file extension problem is not relevant to the current issue page.

Thank you,

Hyun-Hwan Jeong

YiweiNiu commented 6 years ago

Sorry, I didn't view all other issues.

hyunhwan-jeong commented 6 years ago

@YiweiNiu No problem! This is fine. I will let you know when SalmonTE ready to support the file formats!

narojass commented 6 years ago

Hi, I have the same problem, I need to create an index for a species that is not in Repbase. I have the result of RepeatModeler but the program does not recognize the fasta file.

Would it be possible to update the documentation?

hyunhwan-jeong commented 6 years ago

Hello @narojass,

Hi, I have the same problem, I need to create an index for a species that is not in Repbase. I have the result of RepeatModeler but the program does not recognize the fasta file.

Would it be possible to update the documentation?

Could you explain a bit more about that? Did you mention that the fasta file from RepeatModeler cannot be accepted in SalmonTE?

Best,

Hyun-Hwan Jeong

narojass commented 6 years ago

@YiweiNiu, yes, SalmonTe says: "Input file is not to FASTA file". I think it must be the name of the sequence, how can I get the names of hierarchy of classes?

hyunhwan-jeong commented 6 years ago

@narojass, Can you share me the FASTA file you have created?

Thanks,

Hyun-Hwan Jeong

hyunhwan-jeong commented 6 years ago

@narojass, I did not have a file, could you directly send the file to hyunhwaj@bcm.edu?

Thank you!

Hyun-Hwan Jeong

CeciliaDeng commented 3 years ago

Hi @hyunhwaj , I encountered the same problem, Input file is not a FASTA file, at the index step. The command I ran was "_SalmonTE.py index --ref_name=TEST --inputfasta=TEST.fa"

The TEST.fa contains sequences like:

>TE_00000000#TIR/EnSpm_CACTA GTTGAACAGTTTAGAATTTGGTCCATTTGGCAAAG.... >TE_00000002#Unknown AAAATTACAAATAAAATCATTCAAA..... >TE00000034#LTR/unknown [sequences] >TE_00000132#TIR/PIF_Harbinger [sequences]

The index step failed with the same error when a genome assembly was used.

How to fix this, please? Thank you.

hyunhwan-jeong commented 3 years ago

@CeciliaDeng

I suspect the FASTA file is not properly formatted.

  1. As I noted on https://github.com/LiuzLab/SalmonTE/wiki/How-to-build-a-customized-index,

a sequence in the FASTA file should be written as follows:

>B1 SINE1/7SL
gccgggcatggtggcgcacgcctttaatcccagcacttgggaggcagaggcaggcggatttctgagttcg
aggccagcctggtctacanagtgagttccaggacagccagggctacacagagaaaccctgtctcg
  1. It seems that you didn't bring the FASTA file from RepBase, so you would need to properly edit reference/clades_extended.csv file.

  2. As results, you need to fix your FASTA file like

>TE_00000000    TIR/CACTA
GTTGAACAGTTTAGAATTTGGTCCATTTGGCAAA
>TE_00000002    Unknown
AAAATTACAAATAAAATCATTCAAA.....
>TE_00000132    TIR/PIF_Harbinger
[sequences]

, and you need to modify the clades_extended.csv like

TIR/CACTA   TIR/CACTA DNA transposon    Transposable Elements
...
TIR/PIF_Harbinger   TIR/PIF_Harbinger DNA transposon    Transposable Elements

Let me know if you have further questions.

Hyun-Hwan Jeong

CeciliaDeng commented 3 years ago

Hi @hyunhwaj , Thank you for your advice. I changed all the seqIDs accordingly. Now runningSalmonTE.py index --ref_name=TEST --input_fasta=test.fa exited with

2020-11-12 10:54:11,271 Building Salmon Index Traceback (most recent call last): File "/opt/SalmonTE/SalmonTE.py", line 292, in run(args) File "/opt/SalmonTE/SalmonTE.py", line 248, in run build_salmon_index(args['--input_fasta'], args['--ref_name'], args['--te_only']) File "/opt/SalmonTE/SalmonTE.py", line 178, in build_salmon_index os.mkdir(out_path) OSError: [Errno 30] Read-only file system: '/opt/SalmonTE/reference/GtrNCBI'

Is there an option to specify a different output dir?

hyunhwan-jeong commented 3 years ago

SalmonTE currently doesn't have an option to change the directory, but you can locally install SalmonTE to your home folder if you are using a server.

Thank you,

Hyun-Hwan Jeong