roland-rad-lab / MoCaSeq

Analysis pipelines for cancer genome sequencing in mice.
Other
20 stars 15 forks source link

SNPeff ENSEMBL 102 database construction , vep-102 installation and separate vep annotation by species to avoid human only option issues #10

Open Just08 opened 2 years ago

Just08 commented 2 years ago

New SNPeff 102 and VEP 102 annotation work . Note that my custom SNPeff custom database with ensembl 102 data not contains regulation and motif databases due to some issue ( comment part of my dockerfile modification for motif part ).

Just08 commented 2 years ago

For motif part ( comment part of my dockerfile modification ) 0 motifs are loads :

[Optional] Reading motifs: GFF
#51 1605.7 00:02:30             Loading PWMs from : /opt/snpEff-4.3T/./data/GRCm38.102/pwms.bin
#51 1605.7 00:02:30             Loading motifs from : /opt/snpEff-4.3T/./data/GRCm38.102/motif.gff
#51 1633.2 00:02:58             Loadded motifs: 0
#51 1633.2 00:02:58             Saving motifs to: /opt/snpEff-4.3T/./data/GRCm38.102/motif.bin

For regulation part, I test :

gunzip ${PACKAGE_DIR}/snpEff-4.3T/data/GRCm38.102/*.gz \
mkdir ${PACKAGE_DIR}/snpEff-4.3T/data/GRCm38.102/regulation.bed \
wget -nv -r -np -nd -A "*.bed.gz" -e robots=off  http://ftp.ensembl.org/pub/release-102/regulation/mus_musculus/Peaks/ \
ls *.bed.gz | awk -F"." -v mvCmd='mv "%s" "%s"\n' '{printf mvCmd,$0,"regulation."$3"."$4".bed.gz"}' | sh \
mv regulation.*.bed.gz ${PACKAGE_DIR}/snpEff-4.3T/data/GRCm38.102/regulation.bed/ \
gunzip ${PACKAGE_DIR}/snpEff-4.3T/data/GRCm38.102/regulation.bed/*.bed.gz

But I have the same issue that were reported without any solution : https://github.com/pcingola/SnpEff/issues/304

This is why I can't achieve Building databases. Regulatory and Non-coding part of SnpEff documentation .

NikdAK commented 2 years ago

I can confirm the bug regarding the regulatory database build. Anyways I found a workaround: Convert the BED to GFF If the format is like this it will just work. Only columns 1,4,5,9 need valid entries. For the attributes only Cell_type seems to be mandatory, but setting name, alias, etc. could possibly be useful somewhen.

chr1 source feature 4426826 4427337 . . . Cell_type=CHD2_CH12_LX__Enriched_Site

All bed files should be combined into a single gff, which can be .gz to save space.