Closed animesh closed 3 years ago
Looks like it config.zip finally worked 2021-08-13T153310.143502.snakemake.log @acesnik 👍🏽
I am guessing that following are the results for the variant calling?
(spritzbase) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ ls -ltrh /home/animeshs/rnAGS/final/
total 1.1G
-rwxrwxrwx 1 animeshs animeshs 552M Aug 16 16:21 combined.spritz.snpeff.vcf
-rwxrwxrwx 1 animeshs animeshs 65M Aug 16 16:21 combined.spritz.snpeff.protein.fasta
-rwxrwxrwx 1 animeshs animeshs 150M Aug 16 16:21 combined.spritz.snpeff.protein.withdecoys.fasta
-rwxrwxrwx 1 animeshs animeshs 78M Aug 16 16:21 combined.spritz.snpeff.protein.withmods.xml.gz
-rwxrwxrwx 1 animeshs animeshs 34M Aug 16 16:21 Homo_sapiens.GRCh38.100.protein.fasta
-rwxrwxrwx 1 animeshs animeshs 79M Aug 16 16:21 Homo_sapiens.GRCh38.100.protein.withdecoys.fasta
-rwxrwxrwx 1 animeshs animeshs 76M Aug 16 16:21 Homo_sapiens.GRCh38.100.protein.withmods.xml.gz
If so, are the following numbers of protein sequences look reasonable to you?
(spritzbase) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ for i in /home/animeshs/rnAGS/final/*.fasta; do echo $i; grep "^>" $i | wc; done
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.fasta
63305 437182 6560454
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.withdecoys.fasta
152804 611216 14673896
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.fasta
73488 585678 29130709
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.withdecoys.fasta
176685 1069046 67683537
The with decoys numbers are confusing me though, like 152804 is off by 26194 if the reverse is included like 2*63305 or does it contain the variants too? The number difference is 29709 for the last two...
How are the variants being encoded in the fasta files? Is there a straightforward within Spritz way to summarize those?
These two files are actually the results of different programs, which should probably be clearer in the filenames. The first ones (Homo_sapiens.GRCh38.100.protein.fasta
, combined.spritz.snpeff.protein.fasta
) are produced by Spritz's fork of SnpEff that does variant annotation, and the second batch (Homo_sapiens.GRCh38.100.protein.withdecoys.fasta
, combined.spritz.snpeff.protein.withmods.xml.gz
, etc) is produced by SpritzModifications (formerly TransferUniProtModifications) that applies them to proteins. The reason for the discrepancy in counts is that SpritzModifications does limited combinatorial expansion of heterozygous variations, whereas SnpEff does not.
If you check grep "^>mz" $i | wc
for targets, versus grep "^>rev_mz" $i | wc
for decoys, you can verify that the number of decoys is the same as the number of targets (e.g., 76402 for both for Homo_sapiens.GRCh38.100.protein.withdecoys.fasta).
If you want to use the FASTA file with variants but without decoys, I would recommend selecting grep "^>mz" combined.spritz.snpeff.protein.withdecoys.fasta > combined.spritz.snpeff.protein.spritzmods.fasta
.
Great @acesnik , 152804/2=>76402 matches perfectly though 88340 and 88345 don't, does it mean there are only 5 variants?
(spritzbase) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ for i in /home/animeshs/rnAGS/final/*.fasta; do echo $i;grep "^>mz" $i | wc; done
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.fasta
63305 437182 6560454
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.withdecoys.fasta
76402 305608 7184144
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.fasta
73488 585678 29130709
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.withdecoys.fasta
88340 443908 33028098
(spritzbase) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ for i in /home/animeshs/rnAGS/final/*.fasta; do echo $i;grep "^>rev_mz" $i | wc; done
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.fasta
0 0 0
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.withdecoys.fasta
76402 305608 7489752
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.fasta
0 0 0
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.withdecoys.fasta
88345 625138 34655439
do you think the results are fine in general? Should i go forward with /home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.withdecoys.fasta* none the less?
Thanks, looks like you replied in between @acesnik . So i am guessing the variants are being encoded within the ^mz_* ? How to extract them and confirm, any way within Spritz?
Oh, that's interesting about the extra decoys. There is a process of reversing the variants in decoy generation, and that might change the length and thus the filtering based on having proteins >7 AAs. I'll look into that.
The decoy discrepancy is pretty small, so it might not have the biggest effect, but like I mentioned, you could select just the targets with grep "^>mz" combined.spritz.snpeff.protein.withdecoys.fasta > combined.spritz.snpeff.protein.spritzmods.fasta
.
All target proteins are encoded with "^>mz", and all decoys are encoded with "^>rev_mz". Variants will have the tag "variant:" in the header, so to check for the count of proteins with variants, I'd recommend doing grep "^>mz" $i | grep "variant:" | wc
.
In typical human experiments, I usually find about a quarter of the entries have variants.
Awesome @acesnik 👍🏽 Looks like there are about 15000 variants
(spritzbase) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ for i in /home/animeshs/rnAGS/final/*.fasta; do echo $i;grep "^>mz" $i | grep "variant:" | wc ; done
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.fasta
0 0 0
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.withdecoys.fasta
0 0 0
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.fasta
14695 179491 23040270
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.withdecoys.fasta
17324 159844 26350306
which is less than expected then? Probably the read depth was not enough, where to check for the mapping metrics and play with thresholds used for variant-calling? Is there some easy to configure within the config.yaml?
That's right around where I would expect. Probably 15-30% of entries or something. I wouldn't try changing any of the thresholding, personally.
It's actually probably more than 15000 variants, since multiple are encoded per protein. You can check on some other figures in the output of SpritzModifications at combined.spritz.snpeff.protein.withmods.log
. Right now, it has one summary for the targets only, and then another one for both targets and decoys (which you can ignore).
The first one also tells you the number of variant containing proteins (hopefully the same number as you saw above), the number of unique variants on those proteins, and the number of unique variants by type: missense, frameshift, deletion, etc.
You can also take a step back from there and look at the numbers of variants detected before applying them to proteins by opening combined.spritz.snpeff.html
, which is a report from SnpEff variant annotation.
The numbers seem to be matching withmods.log
Welcome to SpritzModifications!
Transfering modifications from UniProt database ../resources/uniprot/Homo_sapiens.protein.xml.gz to /home/animeshs/rnAGS/variants/combined.spritz.snpeff.protein.xml
76402 Canonincal proteins translated from gene model (without applied variations)
46883 Proteins with exact sequence match in UniProt
16422 Proteins without exact sequence match in UniProt
Analyzing resulting database /home/animeshs/rnAGS/variants/combined.spritz.snpeff.protein.withmods.xml
Spritz Database Summary
--------------------------------------------------------------
63305 Total number of canonical protein entries (before applying variations)
73491 Total number of protein entries
51237 Total modifications appended from UniProt out of 53436
10186 Total number of variant containing protein entries
13729 Total number of unique variants
231 Total number of unique synonymous variants
13498 Total number of unique nonsynonymous variants
12833 Number of unique SNV missense variants
188 Number of unique MNV missense variants
346 Number of unique frameshift variants
2 Number of unique insertion variants
21 Number of unique deletion variants
74 Number of unique stop gain variants
34 Number of unique stop loss variants
Spritz Database Summary
--------------------------------------------------------------
126610 Total number of canonical protein entries (before applying variations)
146983 Total number of protein entries
100982 Total modifications appended from UniProt out of 105326
20373 Total number of variant containing protein entries
27584 Total number of unique variants
474 Total number of unique synonymous variants
27110 Total number of unique nonsynonymous variants
25781 Number of unique SNV missense variants
378 Number of unique MNV missense variants
691 Number of unique frameshift variants
4 Number of unique insertion variants
40 Number of unique deletion variants
148 Number of unique stop gain variants
68 Number of unique stop loss variants
I will check the HTML too, thanks a lot @acesnik 👍🏽
Docker is running though... below is the full log