Preparing TPM files with gene symbol

jasonsaunderswilliams commented 5 years ago

Hi Gianni,

I am very interested in using your tool. Could you let me know which program(s) you used to prepare your TPM tab file with gene symbols? I have generated count tables before, however they usually are e.g. ensembl ids and not TPM.

I am working with paired-end fastq files from whole PBMCs.

I already have gene count table generated through quantmode in STAR...

Best wishes,

Jason

giannimonaco commented 5 years ago

Hi Jason,

thank you for your interest in using the tool!

The TPM values were obtained using kallisto and tximport. Kallisto gives the TPM values for each transcripts and tximport summarise the expression values into gene ones. For the annotation of each transcripts and genes, I used the data from GENCODE (https://www.gencodegenes.org/human/). Kallisto wants the transcriptome annotation and it will give you both ensembl IDs and gene symbols.

If you want to use the count table that you obtained with STAR, you can transform them to TPM values with a formula like the one you find here: https://github.com/dariober/bioinformatics-cafe/blob/master/TPM.R You can also convert ensembl ID into gene symbols with the annotation from GENCODE or Biomart.

Let me know if you any issue.

Best wishes, Gianni

jasonsaunderswilliams commented 5 years ago

Hi Gianni,

Thanks for your response.

I am running kallisto now (taking some time).

First built index, now performing quantification. Just to check, the abundances.tsv output - I included --plain text option - will include both TPM and corresponding gene symbols?

Best wishes,

Jason

giannimonaco commented 5 years ago

Hi Jason,

yes, the abundance.tsv files should contain ensembl IDs, gene symbols and other things. They should be delimited with the symbol "|". You can split them and retain only the gene symbols.

Best wishes,

GIanni

jasonsaunderswilliams commented 5 years ago

Hi Gianni,

Thanks for your response again. Sorry, I initially did not use a gtf file and chromosome length file. So ended up with a list of transcript IDs with TPMs. I am now running with these two files. Will I now get a table with ensembl IDs and gene symbols as you described above?

i.e. this command from the kallisto manual: kallisto quant -i transcripts.kidx -b 30 -o kallisto_out --genomebam --gtf transcripts.gtf.gz --chromosomes chrom.txt reads_1.fastq.gz reads_2.fastq.gz

It's not clear from the manual that you will generate a abundances.tsv with these identifiers.

Thanks again for your help.

Best wishes,

Jason

giannimonaco commented 5 years ago

Actually, I did not use the gtf file in the kallisto command and I obtained all the gene identifiers in the abundances.tsv. What did you obtain?

On Fri, 6 Sep 2019 at 11:46, jasonsaunderswilliams notifications@github.com wrote:

Hi Gianni,

Thanks for your response again. Sorry, I initially did not use a gtf file and chromosome length file. So ended up with a list of transcript IDs with TPMs. I am now running with these two files. Will I now get a table with ensembl IDs and gene symbols as you described above?

i.e. this command from the kallisto manual: kallisto quant -i transcripts.kidx -b 30 -o kallisto_out --genomebam --gtf transcripts.gtf.gz --chromosomes chrom.txt reads_1.fastq.gz reads_2.fastq.gz

It's not clear from the manual that you will generate a abundances.tsv with these identifiers.

Thanks again for your help.

Best wishes,

Jason

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/giannimonaco/ABIS/issues/8?email_source=notifications&email_token=AC2UTEFCGTGEAX64ETMOSADQIIRHXA5CNFSM4ITQKNZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6CLAEI#issuecomment-528789521, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2UTEGPNALLPLRBVVFHMLTQIIRHXANCNFSM4ITQKNZA .

jasonsaunderswilliams commented 5 years ago

So the run with the gtf file produced the same output: target_id length eff_length est_counts tpm ENST00000632684.1 12 13 0 0 ENST00000434970.2 9 10 0 0 ENST00000448914.1 13 14 0 0 ENST00000415118.1 8 9 0 0 ENST00000631435.1 12 13 0 0 ENST00000390567.1 20 21 0 0 ENST00000439842.1 11 12 0 0 ENST00000454908.1 17 18 0 0 ENST00000390583.1 31 6 0 0 ENST00000390572.1 28 3 0 0 ENST00000390571.1 31 6 0 0 ENST00000454691.1 18 19 0 0 ENST00000390588.1 20 21 0 0 ENST00000390581.1 23 24 0 0 ENST00000390574.1 21 22 0 0 ENST00000450276.1 17 18 0 0 ENST00000431870.1 16 17 0 0 ENST00000414852.1 16 17 0 0 ENST00000390590.1 31 6 0 0 ENST00000390584.1 31 6 0.5 3.15288 ENST00000452198.1 18 19 0 0 ENST00000634154.1 16 17 0 0 ENST00000631895.1 23 24 0 0 ENST00000633030.1 19 20 0 0 ENST00000632524.1 11 12 0 0 ENST00000633009.1 20 21 0 0 ENST00000634070.1 18 19 0 0 ENST00000390591.1 31 6 0 0 ENST00000431440.2 16 17 0 0 ENST00000390580.1 21 22 0 0 ENST00000451044.1 17 18 0 0 ENST00000390569.1 20 21 0 0 ENST00000390578.1 31 6 0.5 3.15288 ENST00000430425.1 17 18 0 0 ENST00000390585.1 31 6 0 0 ENST00000437320.1 19 20 0 0 ENST00000390575.1 20 21 0 0 ENST00000390577.1 37 7.5 1 5.04461

I can't see any options in the manual for gene id or symbol.

Do you have the arguments you used to generate your abunances.tsv file?

All the best,

Jason

giannimonaco commented 5 years ago

Hi Jason,

I understand, you have only the ensembl IDs for your transcripts. Otherwise you can try retrieving the gene symbols for each transcript with something like Biomart ( https://bioconductor.org/packages/release/bioc/html/biomaRt.html).

The command I used was this one:

kallisto quant -i Transcriptome_gencode26.idx -o Sample1_kallisto -b 100 -t12 Sample1_1.fastq.gz Sample1_2.fastq.gz

Best wishes,

Gianni

On Fri, 6 Sep 2019 at 13:02, jasonsaunderswilliams notifications@github.com wrote:

So the run with the gtf file produced the same output: target_id length eff_length est_counts tpm ENST00000632684.1 12 13 0 0 ENST00000434970.2 9 10 0 0 ENST00000448914.1 13 14 0 0 ENST00000415118.1 8 9 0 0 ENST00000631435.1 12 13 0 0 ENST00000390567.1 20 21 0 0 ENST00000439842.1 11 12 0 0 ENST00000454908.1 17 18 0 0 ENST00000390583.1 31 6 0 0 ENST00000390572.1 28 3 0 0 ENST00000390571.1 31 6 0 0 ENST00000454691.1 18 19 0 0 ENST00000390588.1 20 21 0 0 ENST00000390581.1 23 24 0 0 ENST00000390574.1 21 22 0 0 ENST00000450276.1 17 18 0 0 ENST00000431870.1 16 17 0 0 ENST00000414852.1 16 17 0 0 ENST00000390590.1 31 6 0 0 ENST00000390584.1 31 6 0.5 3.15288 ENST00000452198.1 18 19 0 0 ENST00000634154.1 16 17 0 0 ENST00000631895.1 23 24 0 0 ENST00000633030.1 19 20 0 0 ENST00000632524.1 11 12 0 0 ENST00000633009.1 20 21 0 0 ENST00000634070.1 18 19 0 0 ENST00000390591.1 31 6 0 0 ENST00000431440.2 16 17 0 0 ENST00000390580.1 21 22 0 0 ENST00000451044.1 17 18 0 0 ENST00000390569.1 20 21 0 0 ENST00000390578.1 31 6 0.5 3.15288 ENST00000430425.1 17 18 0 0 ENST00000390585.1 31 6 0 0 ENST00000437320.1 19 20 0 0 ENST00000390575.1 20 21 0 0 ENST00000390577.1 37 7.5 1 5.04461

I can't see any options in the manual for gene id or symbol.

Do you have the arguments you used to generate your abunances.tsv file?

All the best,

Jason

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/giannimonaco/ABIS/issues/8?email_source=notifications&email_token=AC2UTEG6WWHHKTMQOUKWMLTQII2CRA5CNFSM4ITQKNZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6CQJMI#issuecomment-528811185, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2UTEF7TPJLHWEYAPJGR4DQII2CRANCNFSM4ITQKNZA .

jasonsaunderswilliams commented 5 years ago

Hi Gianni,

Apologies, could you let me know how you generated your index in kallisto? The problem is is that I would need to compile the transcripts in that table into genes (I don't think you can do this using Biomart). Hopefully I can avoid doing this - as I don't have much coding/R experience as you can tell.

Best,

Jason

giannimonaco commented 5 years ago

Hi Jason,

No problems. I generated the index in this way: kallisto index -i Transcriptome_gencode26.idx gencode.v26.transcripts.fa

Hope you can solve it in this way. Let me know if it works!

Best, Gianni

On Fri, 6 Sep 2019 at 17:22, jasonsaunderswilliams notifications@github.com wrote:

Hi Gianni,

Apologies, could you let me know how you generated your index in kallisto? The problem is is that I would need to compile the transcripts in that table into genes (I don't think you can do this using Biomart). Hopefully I can avoid doing this - as I don't have much coding/R experience as you can tell.

Best,

Jason

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/giannimonaco/ABIS/issues/8?email_source=notifications&email_token=AC2UTEEYWAQM5CPQV4FDKMLQIJYS5A5CNFSM4ITQKNZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6DFSZQ#issuecomment-528898406, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2UTEBOBQE5JUYIDQRKD2DQIJYS5ANCNFSM4ITQKNZA .

jasonsaunderswilliams commented 5 years ago

Hi Gianni,

So we managed to use Tximport in R to generate a TPM table with gene symbols.

This what I could see from the PBMC.txt example file you have on here was the input.

However, I get an error when I try running it through ABIS.

Here's what the tab delimited file looks like:

A1BG 443.8711582 A1CF 9023.894 A2M 4646.894 A2ML1 2759.33676 A2MP1 1003.894 A3GALT2 825.894 A4GALT 1737.894 A4GNT 1573.894 AAAS 1108.690818 AACS 2124.734521 AACSP1 2624.894 AADAC 983.673 AADACL2 4650.894 AADACL3 3549.894 AADACL4 1377.894 AADACP1 545.7113333 AADAT 609.385 AAGAB 2312.413629 AAK1 4940.470245 AAMDC 384.5673495 AAMP 1332.291864 AANAT 1715.894 AAR2 2228.70423 AARD 2101.894

Any help would be appreciated.

Best wishes,

Jason

giannimonaco commented 5 years ago

Hi Jason,

Can I have more details about the error? Did you prepare the file like the TPMPBMC.txt file in the data folder?

Best, Gianni

On Fri, 13 Sep 2019 at 18:08, jasonsaunderswilliams < notifications@github.com> wrote:

Hi Gianni,

So we managed to use Tximport in R to generate a TPM table with gene symbols.

This what I could see from the PBMC.txt example file you have on here was the input.

However, I get an error when I try running it through ABIS.

Here's what the tab delimited file looks like:

A1BG 443.8711582 A1CF 9023.894 A2M 4646.894 A2ML1 2759.33676 A2MP1 1003.894 A3GALT2 825.894 A4GALT 1737.894 A4GNT 1573.894 AAAS 1108.690818 AACS 2124.734521 AACSP1 2624.894 AADAC 983.673 AADACL2 4650.894 AADACL3 3549.894 AADACL4 1377.894 AADACP1 545.7113333 AADAT 609.385 AAGAB 2312.413629 AAK1 4940.470245 AAMDC 384.5673495 AAMP 1332.291864 AANAT 1715.894 AAR2 2228.70423 AARD 2101.894

Any help would be appreciated.

Best wishes,

Jason

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/giannimonaco/ABIS/issues/8?email_source=notifications&email_token=AC2UTEGTMJM7PZJJBDEDLQDQJO3IFA5CNFSM4ITQKNZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6VPMBY#issuecomment-531297799, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2UTEAZ7MPMGIVUNWLBTP3QJO3IFANCNFSM4ITQKNZA .

jasonsaunderswilliams commented 5 years ago

HI Gianni, Here's the error:

Error: An error has occurred. Check your logs or contact the app author for clarification.

I checked the TPMPBMC.txt file, I am only able to view it in my browser. It looks like it's just gene symbol with associated TPM?

To summarise, I quantified abundance in salmon, generating a quant.sf file.

This quant.sf file had ensembl transcript IDs. So we used tximport to convert this to abunances at the gene level.

The output of txi import is this file: abundance counts length countsFromAbundance A1BG 14.801154 174.389368466109 443.871158242256 lengthScaledTPM A1CF 0.070976 17.0009428122482 9023.894 lengthScaledTPM A2M 0.081072 10.0000265727209 4646.894 lengthScaledTPM A2ML1 0.21693 15.8888265514784 2759.33676033744 lengthScaledTPM A2MP1 0.112581 2.99999452203194 1003.894 lengthScaledTPM A3GALT2 0.09123 1.99999765781185 825.894 lengthScaledTPM A4GALT 0.043355 1.99999990742171 1737.894 lengthScaledTPM A4GNT 0 0 1573.894 lengthScaledTPM AAAS 43.392327 1276.99998614927 1108.69081785711 lengthScaledTPM AACS 9.327308 526.051778446881 2124.73452107618 lengthScaledTPM AACSP1 0.071761 4.9999740847518 2624.894 lengthScaledTPM AADAC 0 0 983.673 lengthScaledTPM AADACL2 0.0162 1.99994916657532 4650.894 lengthScaledTPM AADACL3 0 0 3549.894 lengthScaledTPM AADACL4 0 0 1377.894 lengthScaledTPM

Fourth column is TPM.

I then just removed the other columns in excel and saved as a .txt file that only has gene symbol and TPM.

Please let me know if you need any more info.

Thanks again,

Jason

giannimonaco commented 5 years ago

Hi Jason,

Yes, it is just gene symbols and TPM values. Each column should contain the TPM value of a different sample. I believe the problem is that you modifying the file with excel. Excel automatically modify certain gene symbols to dates, e.g. MARCH1 to 1-Mar. ( https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1044-7 ).

You should modify the file with something else, for example with R directly.

Best,

Gianni

On Sat, 14 Sep 2019 at 11:11, jasonsaunderswilliams < notifications@github.com> wrote:

HI Gianni, Here's the error:

Error: An error has occurred. Check your logs or contact the app author for clarification.

I checked the TPMPBMC.txt file, I am only able to view it in my browser. It looks like it's just gene symbol with associated TPM?

To summarise, I quantified abundance in salmon, generating a quant.sf file.

This quant.sf file had ensembl transcript IDs. So we used tximport to convert this to abunances at the gene level.

The output of txi import is this file: abundance counts length countsFromAbundance A1BG 14.801154 174.389368466109 443.871158242256 lengthScaledTPM A1CF 0.070976 17.0009428122482 9023.894 lengthScaledTPM A2M 0.081072 10.0000265727209 4646.894 lengthScaledTPM A2ML1 0.21693 15.8888265514784 2759.33676033744 lengthScaledTPM A2MP1 0.112581 2.99999452203194 1003.894 lengthScaledTPM A3GALT2 0.09123 1.99999765781185 825.894 lengthScaledTPM A4GALT 0.043355 1.99999990742171 1737.894 lengthScaledTPM A4GNT 0 0 1573.894 lengthScaledTPM AAAS 43.392327 1276.99998614927 1108.69081785711 lengthScaledTPM AACS 9.327308 526.051778446881 2124.73452107618 lengthScaledTPM AACSP1 0.071761 4.9999740847518 2624.894 lengthScaledTPM AADAC 0 0 983.673 lengthScaledTPM AADACL2 0.0162 1.99994916657532 4650.894 lengthScaledTPM AADACL3 0 0 3549.894 lengthScaledTPM AADACL4 0 0 1377.894 lengthScaledTPM

Fourth column is TPM.

I then just removed the other columns in excel and saved as a .txt file that only has gene symbol and TPM.

Please let me know if you need any more info.

Thanks again,

Jason

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/giannimonaco/ABIS/issues/8?email_source=notifications&email_token=AC2UTEF62U24GLDN6L3DV4LQJSTDNA5CNFSM4ITQKNZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6WX55Y#issuecomment-531463927, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2UTEGGWLGZM2PPTGMFE6LQJSTDNANCNFSM4ITQKNZA .

jasonsaunderswilliams commented 5 years ago

Hello again Gianni,

Thanks for the tip. So after removing the other columns using R I still have the same error message.

I attach the file I am inputting so you can see it.

I have only one sample at the moment that I am trying this out with as you can see. eg3.txt

Best wishes,

Jason

jasonsaunderswilliams commented 5 years ago

Hi Gianni,

Here is a tab delimited version of that file as well, also same error message.

By the way, our TPM values are much higher than the example file and are rounded to various decimal places, hopefully neither of these are the problem.

Best,

Jason eg3.txt

giannimonaco commented 5 years ago

Hi Jason,

The tool was not able to handle one column only before. Now it does. Try and let me know!

Note that your file has two spaces instead of a tab for the second gene (A1CF). Moreover, the first row should have headers for your columns.

Best,

Gianni

On Mon, 16 Sep 2019 at 12:48, jasonsaunderswilliams < notifications@github.com> wrote:

Hi Gianni,

Here is a tab delimited version of that file as well, also same error message.

By the way, our TPM values are much higher than the example file and are rounded to various decimal places, hopefully neither of these are the problem.

Best,

Jason eg3.txt https://github.com/giannimonaco/ABIS/files/3615876/eg3.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/giannimonaco/ABIS/issues/8?email_source=notifications&email_token=AC2UTEBGWZOXIRT53LT7XATQJ5P6NA5CNFSM4ITQKNZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6YYE5Y#issuecomment-531726967, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2UTEABX2QXGFCK2AF3DODQJ5P6NANCNFSM4ITQKNZA .

jasonsaunderswilliams commented 4 years ago

Hi Gianni,

All working now. You were right about excel, you can get around this by formatting the cells as "text" rather "general" although it's still better to edit in R.

Thanks for your help!

The GeneViewer is great also, however, I wonder is it possible to get TPM tables - i.e. that include all genes - for each cell?

Best wishes,

Jason

giannimonaco commented 4 years ago

Hi Jason,

you can find the full table with TPM values from the GEO database ( https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107011). Scroll down and find it as a supplementary file.

Best, Gianni

On Sun, 22 Sep 2019 at 15:35, jasonsaunderswilliams < notifications@github.com> wrote:

Hi Gianni,

All working now. You were right about excel, you can get around this by formatting the cells as "text" rather "general" although it's still better to edit in R.

Thanks for your help!

The GeneViewer is great also, however, I wonder is it possible to get TPM tables - i.e. that include all genes - for each cell?

Best wishes,

Jason

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/giannimonaco/ABIS/issues/8?email_source=notifications&email_token=AC2UTEFVEMU67JOB4SMRCNDQK5YAZA5CNFSM4ITQKNZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7JGMPA#issuecomment-533882428, or mute the thread https://github.com/notifications/unsubscribe-auth/AC2UTEFGBPDC3CF4YBTTBTDQK5YAZANCNFSM4ITQKNZA .

giannimonaco / ABIS

Preparing TPM files with gene symbol #8