pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
141 stars 24 forks source link

de novo `kb ref` and `kb ref -d` generate different index and t2g files #180

Closed sbwiecko closed 1 year ago

sbwiecko commented 1 year ago

After downloading a prebuild index as follow:

kb ref \
-d human \ #download a prebuilt index
-i Homo_sapiens.GRCh38.cdna.all.index \
-g t2g.txt

t2g.txt is 9.6Mb and the index file 3.0Gb, while building it de novo:

kb ref \
-i Homo_sapiens.GRCh38.cdna.all.index \
-g t2g.txt \
-f1 transcriptome.fa \
Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz \
Homo_sapiens.GRCh38.108.gtf.gz

gives a 19Mb t2g.txt and 3.2Gb index files.

It looks like the de novo method generates a t2g.txt file with empty rows or transcprits with no correspondence to any gene:

ENST00000447513.7   ENSG00000157911.11  PEX10   PEX10-202   1   2403974 2412564 -
ENST00000417061.2   ENSG00000269896.2                   1   2351644 2351857 -
ENST00000602865.1   ENSG00000269896.2                   1   2350414 2352820 -

And this lead to some issue in my downstream analysis while creating a Seurat object from a Read10X expression matrix, because there are empty rownames in the matrix.

Is there anything I did wrong ? How to get ride of the transcripts with no corresponding gene ?

Yenaled commented 1 year ago
  1. What version of kb are you using? (kb --version)
  2. Can you provide the URLs for Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz and Homo_sapiens.GRCh38.108.gtf.gz?
sbwiecko commented 1 year ago
  1. kb_python 0.27.3
  2. I used gget ref -w dna,gtf homo_sapiens to get the links to these 2 files:
Yenaled commented 1 year ago

Thanks! This is because the gene_name field is empty for those transcripts in the latest ENSEMBL GTF annotation (though I've found the gene name AL513477.1 in some other GTFs). You could either simply use the ENSEMBL gene IDs (since none of those will be empty) in lieu of gene names or you can move the ENSEMBL gene IDs into the third column (the gene names column) of the t2g.txt file for any records where there is a missing gene name.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days