Closed guanqiaofeng closed 5 months ago
Inputs: reference genome fasta: GRCh38_Verily_v1.genome.fa reference genome fasta.fai: GRCh38_Verily_v1.genome.fa.fai reference gtf: gencode.v40.chr_patch_hapl_scaff.annotation.gtf
Outputs: GRCh38_Verily_v1.gencode_v40.transcriptome.fa GRCh38_Verily_v1.gencode_v40.transcriptome.fa.fai
Code: python3 generate_transcriptome_fa_fai.py
Outputs and Code are in Cumulus VM (ubuntu@10.30.134.105:~/transcript_fa) - please review them @lindaxiang @edsu7
After discussing with Linda, we've clarified our approach: we should focus solely on matching the chromosome or contig names exactly. Previous Python code included unnecessary additional name matching steps. Our rationale is that the transcriptome-based alignment file generated by STAR utilizes the provided FASTA and GTF files, which we assume are configured to require exact matches for chromosome or contig names.
Besides the python approach, gffread can extract transcript fasta files based on genome fasta and gtf file.
Docker image for gffread (v0.12.7): https://hub.docker.com/r/dceoy/gffread
Code:
gffread -w transcripts.fa -g GRCh38_Verily_v1.genome.fa gencode.v40.chr_patch_hapl_scaff.annotation.gtf
$ grep ">" GRCh38_Verily_v1.gencode_v40.transcriptome.fa | wc -l
246683
$ grep ">" transcripts.fa| wc -l
246624
updated python code to only consider exact match of chr/contig name. Output result is the same as gffread.
Update:
- result files are done
To do:
- add as subworkflow in RNA-Seq
- need to upload to reference bucket
Moving item to tickets: https://github.com/icgc-argo/workflow-roadmap/issues/446 https://github.com/icgc-argo/workflow-roadmap/issues/445
In STAR, it outputs transcriptome based alignment bam files. To sort the bam files and convert the bam files to cram files, it requires transcript.fasta and transcript.fasta.fai file. So we need to
[x] generate
transcript.fasta
based on GRCh38_Verily_v1.genome.fa and gtf files[x] generate
transcript.fasta.fai
using samtoolsThese two files need to be uploaded under https://github.com/icgc-argo-workflows/argo-reference-files