hyunhwan-jeong / SalmonTE

SalmonTE is an ultra-Fast and Scalable Quantification Pipeline of Transpose Element (TE) Abundances
GNU General Public License v3.0
81 stars 23 forks source link

Customise Annotation #15

Closed jwalser closed 6 years ago

jwalser commented 6 years ago

Hey there,

I am uncertain with the annotation step for building a customised reference and I was wondering if you could help me gaining confidence in what I do?

We created a reference FASTA file as you described it using the following fasta-headers:

>SINE2 SINE2/tRNA Brachypodium distachyon

In a next step we run the indexing using the following command: SalmonTE.py index --input_fasta=BdTE.fasta --ref_name=bd --te_only

The indexing finished without warnings or error and a new folder is added to the reference directory. I even can run the next step quant and it seems to work. So far so good. I am, however, not sure if (and how) I have to customise the clades_extended.csv file? The TEs from my fasta headers are different from the information listed in the clades extended file. Therefore my questions:

I am nor sure about the format used in the clades_extended csv files. In the provided file you are using the following format:

Mariner/Tc1,Mariner/Tc1,DNA transposon,Transposable Element

  1. TE name
  2. TE name (again?)
  3. Class
  4. TE or simple repeat

Do I understand the format correctly and do I have to use the same or could I use the following instead:

L1-5_BDi,LINE,Retrotransposons,Transposable Element

  1. TE name
  2. Order
  3. Class
  4. TE or simple repeat

Thanks for taking the time to consider my questions.

hyunhwan-jeong commented 6 years ago

Hello @jwalser, thanks for your interest and sorry for my poor documentation for the custom index.

  • Do I have to change the clades_extended file?

Yes, you need to have your own cladeds_extended.csv

  • Do I have to add my TEs to the exiting file?
  • Would it be better to replace the existing file wit ha customised file?
  • What format should I use?

You can do both, but I recommend to make your own file.

Mariner/Tc1,Mariner/Tc1,DNA transposon,Transposable Element

  1. TE name
  2. TE name (again?)
  3. Class
  4. TE or simple repeat

The first one is correct, and the names of your TE/repeats sequences in fasta file should be somewhere in the first column. The second one is the clade of the TE, this is used for the custom clarification for the TE, but if you don't need this then you can put a name as same as the class of the TE. The third column is the class, so you are correct. The last column is the category of the TE, and this will use for --te_only option of index mode.

Do I understand the format correctly and do I have to use the same or could I use the following instead:

L1-5_BDi,LINE,Retrotransposons,Transposable Element

  1. TE name
  2. Order
  3. Class
  4. TE or simple repeat

You don't have to add the additional information in your fasta file. You only need a name of the TE sequence, and clads_extended.csv must have the information of the TE.

Thanks for taking the time to consider my questions.

Please let me know if you need any additional question.

Hyun-Hwan Jeong

jwalser commented 6 years ago

Dear Hyun-Hwan Jeong,

thank you for the swift help. Everything regarding the extension is clear now.

Another question: Do you have any (detailed) documentation about the output files?

Thanks again for all the help!

hyunhwan-jeong commented 6 years ago

@jwalser, my pleasure! We don't have any documents regarding the output files, but I will upload this soon.

Thank you,

Hyun-Hwan Jeong