hyunhwan-jeong / SalmonTE

SalmonTE is an ultra-Fast and Scalable Quantification Pipeline of Transpose Element (TE) Abundances
GNU General Public License v3.0
80 stars 23 forks source link

issues with index output #54

Closed ncnlll closed 3 years ago

ncnlll commented 3 years ago

Hi Hyun-Hwan Jeong, I've tried using SalmonTE on some penguin data from our lab, but i'm having issues with the index step. I created my FASTA file of repeat sequences

EmperorTranscrNew_families.txt

with RepeatModeler and modified the headers as you suggested in "How to build a customized index", and after that I made my own clades_extended.csv like this:

"DNA/hAT-Ac","DNA/hAT-Ac","DNA transposon","Transposable Element" "DNA/Kolobok-H","DNA/Kolobok-H","DNA transposon","Transposable Element" "DNA/PIF-Harbinger","DNA/PIF-Harbinger","DNA transposon","Transposable Element" "LTR/Pao","LTR/Pao","LTR Retrotransposon","Transposable Element" "LTR/Gypsy","LTR/Gypsy","LTR Retrotransposon","Transposable Element" "LTR/ERVL","LTR/ERVL","Endogenous Retrovirus","Transposable Element" "LTR/ERV1","LTR/ERV1","Endogenous Retrovirus","Transposable Element" "LINE/CR1","LINE/CR1","Non-LTR Retrotransposon","Transposable Element" "SINE/MIR","SINE/MIR","Non-LTR Retrotransposon","Transposable Element"

I run the command: python3.6 SalmonTE.py index --input_fasta=/cluster_data/home/genomic/penguins/repeats/OutRepeatModeler/EmperorTranscrNew_families.fa --ref_name=emp --te_only but the output clades.csv file in the reference/emp looks like this: name,class,clade rnd-1_family-19,other,other rnd-1_family-4,other,other rnd-1_family-0,other,other rnd-1_family-1,other,other rnd-1_family-10,other,other rnd-1_family-16,other,other rnd-1_family-8,other,other rnd-1_family-3,other,other rnd-1_family-5,other,other rnd-1_family-7,other,other rnd-1_family-18,other,other rnd-1_family-6,other,other rnd-1_family-17,other,other rnd-1_family-15,other,other rnd-1_family-13,other,other rnd-1_family-12,other,other ...

I don't understand why i have just "other" for all my repeat sequences. Do you have any idea of what could be the problem?

Your help would be very appreciated. Thank you so much

Best, Lorena

fernandes-flavia commented 3 years ago

Hi Lorena,

If you made the .csv file with LibreOffice (Calc) it is possible that it automatically adds the quotation marks " " in your text (e.g., "DNA/hAT-Ac","DNA/hAT-Ac","DNA transposon","Transposable Element"), and that could be causing problems for SalmonTE to read it properly... Try removing the " " and see if it works :)

Best Flavia

ncnlll commented 3 years ago

It worked. Thank you so much Flavia :)

Best Lorena