thelovelab / fishpond

Differential expression and allelic analysis, nonparametric statistics
https://thelovelab.github.io/fishpond
27 stars 9 forks source link

gene_ids after summarising counts with loadFry #33

Closed ChristelKrueger closed 1 year ago

ChristelKrueger commented 1 year ago

Thank you for making Fishpond! I have been using the loadFry function to aggregate USA counts produced by Alevin Fry (adding up U+S+A). I would have expected that the summarised counts table would have a third of the gene_ids but actually there are fewer. The only filtering I found in the documentation was nonzero but that defaults to FALSE. Looking up some of the ENSGs that are missing from the collated output, it seems that they are pseudogenes. Is this some additional filtering that loadFry does?

mikelove commented 1 year ago

hi, Let me tag @DongzeHE who wrote loadFry()

ChristelKrueger commented 1 year ago

Thank you! :-)

DongzeHE commented 1 year ago

Helloe @ChristelKrueger,

Sorry, I missed this message!

Alevin-fry USA mode works as the following:

  1. We first generate a splici reference using a genome FASTA file and a gene GTF file. During this time, we generate a transcript-to-gene name mapping file. Usually, it ends with t2g_3col.tsv. All genes that show up in the second column of this file should exist in the final count matrix.
  2. we build a salmon/piscem reference index.
  3. We map the reads against this reference index.
  4. We process the mapping records and quantify the UMI count for each gene in the transcript-to-gene name mapping file.

So, if a gene is not in the final count matrix, I would suspect that gene is in the t2g_3col.tsv file. Therefore, to answer your question, could you please tell me where did you get the gene_ids? Could you check if the missing genes are in the t2g_3col.tsv file?

Thanks, Dongze

ChristelKrueger commented 1 year ago

Thank you @DongzeHE for these explanations - it helped me to understand what had gone wrong. I had been experimenting with the counts outside the alevin folder structure but had made a silly mistake while copying (which took me an embarrassingly long time to realise ...). Sorry about this - entirely my bad! Done correctly, the numbers do add up as expected.

DongzeHE commented 1 year ago

It's totally fine @ChristelKrueger. Thanks for trying out alevin-fry!

If you process single-cell data regularly, you can also try out our new wrapper program, simpleaf! ;P