hyunhwan-jeong / SalmonTE

SalmonTE is an ultra-Fast and Scalable Quantification Pipeline of Transpose Element (TE) Abundances
GNU General Public License v3.0
81 stars 23 forks source link

No module named 'pandas' #3

Closed wjyzidane closed 6 years ago

wjyzidane commented 6 years ago

I run the SalmonTE like this:

./SalmonTE.py quant --reference=hs ./example/CTRL_1_R1.fastq

and get error as below:

image

But it seems the quantification files are generated here: SalmonTE_output/CTRL_1_R1/quant.sf

but I am not sure if it is right. Appreciate your help. Thanks!

hyunhwan-jeong commented 6 years ago

Hello @wjyzidane,

I believe you can solve the problem after you run below command line:

pip3 install pandas --user

Please let me know it works.

Thank you,

Hyun-Hwan Jeong

wjyzidane commented 6 years ago

Hi Hyun-Hwan,

It works! Thanks a lot!

I am reading the results files from the SalmonTE and I wonder if there is a manual or some detailed explanation about each output file as well as some input options like "--exprtype=exprtype".

Because I am a little bit confused that why NumReads in quant.sf is not the integer. I think the TPM from the quant.sf is the same number in EXPR.csv and that is what we need for the quantification of each repeat, right?

Thanks!

wjyzidane commented 6 years ago

Hi Hyun-Hwan,

I found there are 688 repeat categories included in the EXPR.csv for human genome but actually there are 1396 repeat categories for hg19 repeatmasker. So I wonder why there is such big difference. Thanks!

Jingyi

hyunhwan-jeong commented 6 years ago

Hello Jinygyi,

  1. With --exprtype option, you can put two different type of values - TPM (if you put the value as TPM or does not set the parameter in the command), or NumReads counts (if you put the value as count). There is the reason why a NumReads is not an integer number is that this number is from the estimation (or approximation), but it is fine you can use the number after rounding. If you only want to see the abundance of repeat elements than I would like to recommend you to use TPM option, but if you want to do differential expression analysis with DESeq then please use count option. If it is not clear to you, and you want to have a better answer then, please tell your configuration of the experiment.

  2. I have collected TE elements from Repbase, not RepeatMasker, and had a cleaning phase of redundant elements, so we are able to have 687 elements. Please see below paragraph which explains the process. I quoted from my paper of SalmonTE:

To build the index library for the quasi-mapping, SalmonTE takes the FASTA file of cDNA sequences from TE databases such as Repbase (version 22.06)[23] In the current version, the index files for Homo sapiens and Drosophila melanogaster are available. We reasoned that it is hard to estimate TEs which replicate without an RNA intermediate from RNA-seq sample. Therefore, we excluded the following elements: simple repeats and multi-copy genes, and DNA transposable. After collecting the cDNA sequences, we manually curated clades of each TE based on the repeat class annotation from Repbase. As a result, the generated TE library index database contains 687 TEs for Homo sapiens and 163 TEs for Drosophila melanogaster.

Thank you,

Hwan

wjyzidane commented 6 years ago

It makes sense now. Thank you so much!