rikenbit / ramdaq

MIT License
12 stars 2 forks source link

Documentation for annotation files #94

Open yfukai opened 8 months ago

yfukai commented 8 months ago

Dear maintainers, It is great that the annotation files for the mt, rRNA and histone genes are well documented in https://github.com/rikenbit/ramdaq/blob/master/docs/local_annotation.md . However, I could not find documentation for other files (.ERCC. etc and the *.ht2 files). Considering the traceability and reproducibility of the analysis, it would be great if it is documented how those files were generated from the original GENCODE entries.

yuifu commented 6 months ago

@yfukai We apologize for the delayed response.

We would like to consider where to document the methods for creating annotations on the GitHub repository.

For your reference in the meantime, here is an example of how it can be done:

Logs for making annotations for GENCODE v44 (human) and GENCODE vM33 (mouse)

download gtf, add ERCC information, make mt.gtf (human)

https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.primary_assembly.annotation.gtf.gz .
gunzip gencode.v44.primary_assembly.annotation.gtf.gz

cat gencode.v44.primary_assembly.annotation.gtf ERCC92.gtf > gencode.v44.primary_assembly.annotation.ERCC.gtf
less gencode.v44.primary_assembly.annotation.gtf | grep chrM > gencode.v44.primary_assembly.annotation.mt.gtf

Histone.gtf (human)

less gencode.v44.primary_assembly.annotation.gtf | grep "gene_name \"H1" > gencode.v44.primary_assembly.annotation.histone.gtf;
less gencode.v44.primary_assembly.annotation.gtf | grep "gene_name \"H2" >> gencode.v44.primary_assembly.annotation.histone.gtf;
less gencode.v44.primary_assembly.annotation.gtf | grep "gene_name \"MACROH2" >> gencode.v44.primary_assembly.annotation.histone.gtf;
less gencode.v44.primary_assembly.annotation.gtf | grep "gene_name \"H3" >> gencode.v44.primary_assembly.annotation.histone.gtf;
less gencode.v44.primary_assembly.annotation.gtf | grep "gene_name \"CENPA" >> gencode.v44.primary_assembly.annotation.histone.gtf;
less gencode.v44.primary_assembly.annotation.gtf | grep "gene_name \"H4" >> gencode.v44.primary_assembly.annotation.histone.gtf

### test
python extract_id_symbol_fromGTF.py < gencode.v44.primary_assembly.annotation.histone.gtf > gencode.v44.primary_assembly.annotation.histone.txt

-> % less gencode.v41.primary_assembly.annotation.histone.txt  | wc -l
2530
-> % less gencode.v44.primary_assembly.annotation.histone.txt  | wc -l
2666

download gtf, add ERCC information, make mt.gtf (mouse)

wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M33/gencode.vM33.primary_assembly.annotation.gtf.gz .
gunzip gencode.vM33.primary_assembly.annotation.gtf.gz

cat gencode.vM33.primary_assembly.annotation.gtf ERCC92.gtf > gencode.vM33.primary_assembly.annotation.ERCC.gtf
less gencode.vM33.primary_assembly.annotation.gtf | grep chrM > gencode.vM33.primary_assembly.annotation.mt.gtf

Histone.gtf (mouse)

less gencode.vM33.primary_assembly.annotation.gtf | grep "gene_name \"H1" > gencode.vM33.primary_assembly.annotation.histone.gtf;
less gencode.vM33.primary_assembly.annotation.gtf | grep "gene_name \"Macroh2" >> gencode.vM33.primary_assembly.annotation.histone.gtf;
less gencode.vM33.primary_assembly.annotation.gtf | grep "gene_name \"H2a" >> gencode.vM33.primary_assembly.annotation.histone.gtf;
less gencode.vM33.primary_assembly.annotation.gtf | grep "gene_name \"H2b" >> gencode.vM33.primary_assembly.annotation.histone.gtf;
less gencode.vM33.primary_assembly.annotation.gtf | grep "gene_name \"H3" >> gencode.vM33.primary_assembly.annotation.histone.gtf;
less gencode.vM33.primary_assembly.annotation.gtf | grep "gene_name \"H4" >> gencode.vM33.primary_assembly.annotation.histone.gtf

### test
python extract_id_symbol_fromGTF.py < gencode.vM33.primary_assembly.annotation.histone.gtf > gencode.vM33.primary_assembly.annotation.histone.txt

-> % less gencode.vM30.primary_assembly.annotation.histone.txt  | wc -l
1422
-> % less gencode.vM33.primary_assembly.annotation.histone.txt  | wc -l
1451

create BED files

#human
docker run -it --rm -u 10010:10000 -v "$PWD":$PWD -w $PWD julia:0.6.4 julia ./ngs_misc/gtfToBed12.jl gencode.v44.primary_assembly.annotation.gtf gencode.v44.primary_assembly.annotation.bed

#mouse
docker run -it --rm -u 10010:10000 -v "$PWD":$PWD -w $PWD julia:0.6.4 julia ./ngs_misc/gtfToBed12.jl gencode.vM33.primary_assembly.annotation.gtf gencode.vM33.primary_assembly.annotation.bed

create RSEM index

#human
docker run --rm ramdaq:v1.1 rsem-prepare-reference --gtf ../gencode.v44.primary_assembly.annotation.gtf --bowtie2 --bowtie2-path /opt/conda/envs/ramdaq-1.0dev/bin/ -p 20 ../GRCh38.primary_assembly.genome.fa RSEM_bowtie2_index_gencode_v44

#mouse
docker run --rm ramdaq:v1.1 rsem-prepare-reference --gtf ../gencode.vM33.primary_assembly.annotation.gtf --bowtie2 --bowtie2-path /opt/conda/envs/ramdaq-1.0dev/bin/ -p 20 ../GRCm39.primary_assembly.genome.fa RSEM_bowtie2_index_gencode_vM33

tar -zcvf RSEM_bowtie2_index_gencode_v44.tar.gz RSEM_bowtie2_index_gencode_v44
tar -zcvf RSEM_bowtie2_index_gencode_vM33.tar.gz RSEM_bowtie2_index_gencode_vM33

add to local.config

    'GRCh38_v44' {
      adapter = "${params.local_annot_dir}/all_sequencing_WTA_adopters.fa"
      hisat2_idx = "${params.local_annot_dir}/hisat2_v220_index_GRCh38pri_ERCC/*.ht2"
      hisat2_rrna_idx = "${params.local_annot_dir}/hisat2_index_Human_rRNA_12_mito_5_45/*.ht2"
      chrsize = "${params.local_annot_dir}/GRCh38.primary_assembly.genome.fa.fai"
      bed = "${params.local_annot_dir}/gencode.v44.primary_assembly.annotation.bed"
      gtf = "${params.local_annot_dir}/gencode.v44.primary_assembly.annotation.ERCC.gtf"
      mt_gtf = "${params.local_annot_dir}/gencode.v44.primary_assembly.annotation.mt.gtf"
      rrna_gtf = "${params.local_annot_dir}/Human_GRCh38_rmsk_rRNA_RefSeq_RNA45SN1_5_merge.gtf"
      histone_gtf = "${params.local_annot_dir}/gencode.v44.primary_assembly.annotation.histone.gtf"
      hisat2_sirv_idx = "${params.local_annot_dir}/hisat2_v220_index_SIRVome/*.ht2"
      rsem_sirv_idx = "${params.local_annot_dir}/RSEM_bowtie2_index_SIRVome/*"
      rsem_allgene_idx = "${params.local_annot_dir}/RSEM_bowtie2_index_gencode_v44/*"
      nuclearRNA_qc = "human"
    }

    'GRCm39_vM33' {
      adapter = "${params.local_annot_dir}/all_sequencing_WTA_adopters.fa"
      hisat2_idx = "${params.local_annot_dir}/hisat2_v220_index_GRCm39_primary_ERCC/*.ht2"
      hisat2_rrna_idx = "${params.local_annot_dir}/hisat2_index_Mouse_rRNA_12_mito_5_45/*.ht2"
      chrsize = "${params.local_annot_dir}/GRCm39.primary_assembly.genome.fa.fai"
      bed = "${params.local_annot_dir}/gencode.vM33.primary_assembly.annotation.bed"
      gtf = "${params.local_annot_dir}/gencode.vM33.primary_assembly.annotation.ERCC.gtf"
      mt_gtf = "${params.local_annot_dir}/gencode.vM33.primary_assembly.annotation.mt.gtf"
      rrna_gtf = "${params.local_annot_dir}/Mouse_GRCm39_rmsk_rRNA_Refseq_45S_merge.gtf"
      histone_gtf = "${params.local_annot_dir}/gencode.vM33.primary_assembly.annotation.histone.gtf"
      hisat2_sirv_idx = "${params.local_annot_dir}/hisat2_v220_index_SIRVome/*.ht2"
      rsem_sirv_idx = "${params.local_annot_dir}/RSEM_bowtie2_index_SIRVome/*"
      rsem_allgene_idx = "${params.local_annot_dir}/RSEM_bowtie2_index_gencode_vM33/*"
      nuclearRNA_qc = "mouse"
    }
yfukai commented 6 months ago

Thank you for your informative answer @yuifu!