Generate transcriptome reference file for STAR alignment

icgc-argo / workflow-roadmap

Roadmap and management for genomic data processing

GNU Affero General Public License v3.0

1 stars 0 forks source link

Generate transcriptome reference file for STAR alignment #438

Closed guanqiaofeng closed 5 months ago

guanqiaofeng commented 5 months ago

In STAR, it outputs transcriptome based alignment bam files. To sort the bam files and convert the bam files to cram files, it requires transcript.fasta and transcript.fasta.fai file. So we need to

[x] generate transcript.fasta based on GRCh38_Verily_v1.genome.fa and gtf files
[x] generate transcript.fasta.fai using samtools

These two files need to be uploaded under https://github.com/icgc-argo-workflows/argo-reference-files

guanqiaofeng commented 5 months ago

Inputs: reference genome fasta: GRCh38_Verily_v1.genome.fa reference genome fasta.fai: GRCh38_Verily_v1.genome.fa.fai reference gtf: gencode.v40.chr_patch_hapl_scaff.annotation.gtf

Outputs: GRCh38_Verily_v1.gencode_v40.transcriptome.fa GRCh38_Verily_v1.gencode_v40.transcriptome.fa.fai

Code: python3 generate_transcriptome_fa_fai.py

Outputs and Code are in Cumulus VM (ubuntu@10.30.134.105:~/transcript_fa) - please review them @lindaxiang @edsu7

guanqiaofeng commented 5 months ago

After discussing with Linda, we've clarified our approach: we should focus solely on matching the chromosome or contig names exactly. Previous Python code included unnecessary additional name matching steps. Our rationale is that the transcriptome-based alignment file generated by STAR utilizes the provided FASTA and GTF files, which we assume are configured to require exact matches for chromosome or contig names.

Besides the python approach, gffread can extract transcript fasta files based on genome fasta and gtf file.

Docker image for gffread (v0.12.7): https://hub.docker.com/r/dceoy/gffread

Code:

gffread -w transcripts.fa -g GRCh38_Verily_v1.genome.fa gencode.v40.chr_patch_hapl_scaff.annotation.gtf

$ grep ">" GRCh38_Verily_v1.gencode_v40.transcriptome.fa | wc -l
  246683

$ grep ">" transcripts.fa| wc -l
  246624

guanqiaofeng commented 5 months ago

updated python code to only consider exact match of chr/contig name. Output result is the same as gffread.

edsu7 commented 5 months ago

~~Update:~~ ~~- result files are done~~ ~~To do:~~ ~~- add as subworkflow in RNA-Seq~~ ~~- need to upload to reference bucket~~

Moving item to tickets: https://github.com/icgc-argo/workflow-roadmap/issues/446 https://github.com/icgc-argo/workflow-roadmap/issues/445