uclahs-cds / package-moPepGen

Multi-Omics Peptide Generator
https://uclahs-cds.github.io/package-moPepGen/
GNU General Public License v2.0
5 stars 1 forks source link

[minor] would be nice to have a util module for `extract gvf` #292

Closed lydiayliu closed 2 years ago

lydiayliu commented 2 years ago

trying to downsample and test circRNA + variants on a single transcript (the one that shows up after 3 hours)

did downsampling first

moPepGen-util downsampleReference \
    --tx-list ENST00000341594.9 \
    --output-dir /data/Index/ENST00000341594.9/ \
    --genome-fasta /reference/GRCh38.p13.genome.fa \
    --annotation-gtf /reference/gencode.v34.chr_patch_hapl_scaff.annotation.gtf \
    --proteome-fasta /reference/gencode.v34.pc_translations.fa \
    --translate-noncoding true

but of course inputting whole gvfs from circRNA + all variants will produce an error

a=/data/Parser/VEP/gencode/gsnp/CPCG0259.gencode.tsv.gvf
b=$(basename -- "$a"); echo ${b};
c="${b%%.*}"; echo ${c};
moPepGen callVariant \
    --input-variant /data/Parser/CIRCexplorer3/TOPHAT/${c}_IP_quant.txt.1.3ff.gvf \
        /data/Parser/VEP/gencode/gsnp/${b} \
        /data/Parser/VEP/gencode/gindel/${b} \
        /data/Parser/VEP/gencode/somaticsniper/${b} \
        /data/Parser/VEP/gencode/pindel/${b} \
    --genome-fasta /data/Index/ENST00000341594.9/genome.fasta \
    --annotation-gtf /data/Index/ENST00000341594.9/annotation.gtf \
    --proteome-fasta /data/Index/ENST00000341594.9/proteome.fasta \
    --output-fasta /data/Variant/CIRCexplorer3/TOPHAT/circ_ssm/${c}.ENST00000341594.9.3f.fasta

Attempting to use validateVariantCalling for the downsampling GVF part but that ran into a minor problem lol #291

a=/data/Parser/VEP/gencode/gsnp/CPCG0259.gencode.tsv.gvf
b=$(basename -- "$a"); echo ${b};
c="${b%%.*}"; echo ${c};
moPepGen-util validateVariantCalling \
    -i /data/Parser/CIRCexplorer3/TOPHAT/${c}_IP_quant.txt.1.3ff.gvf \
        /data/Parser/VEP/gencode/gsnp/${b} \
        /data/Parser/VEP/gencode/gindel/${b} \
        /data/Parser/VEP/gencode/somaticsniper/${b} \
        /data/Parser/VEP/gencode/pindel/${b} \
    -t ENST00000341594.9 \
    -o /data/Variant/CIRCexplorer3/TOPHAT/circ_ssm/CPCG0259_ENST00000341594.9 \
    --genome-fasta /data/Index/ENST00000341594.9/genome.fasta \
    --annotation-gtf /data/Index/ENST00000341594.9/annotation.gtf \
    --proteome-fasta /data/Index/ENST00000341594.9/proteome.fasta
zhuchcn commented 2 years ago

I was just doing something like this:

grep '^#' CPCG0259_circRNA.gvf > test.gvf
grep 'ENST00000341594.9' CPCG0259_circRNA.gvf >> test.gvf

But this could get tedious when you have multiple GVF files.

lydiayliu commented 2 years ago

cat /data/Parser/VEP/gencode/pindel/${b} | grep '#\|ENST00000341594.9' the OR \| function is handy!

yeah was slightly annoyed to have to do it 5 times XD