smith-chem-wisc / Spritz

Software for RNA-Seq analysis to create sample-specific proteoform databases from RNA-Seq data
https://smith-chem-wisc.github.io/Spritz/
MIT License
7 stars 11 forks source link

(1) "Error waiting for container: invalid character 'u' looking for beginning of value" (2) "Could not execute because the application was not found or a compatible .NET SDK is not installed." #212

Closed animesh closed 3 years ago

animesh commented 3 years ago

Docker is running though... below is the full log

Command executing: Powershell.exe docker pull smithlab/spritz ;docker run --rm -i -t --name spritz-956943642 -v """Z:\AGS\AGS RNAseq CNIO\raw data:/app/analysis""" -v """Z:\AGS\AGS RNAseq CNIO\raw data\data:/app/data""" -v """Z:\AGS\AGS RNAseq CNIO\raw data\configs:/app/configs""" smithlab/spritz; docker stop spritz-956943642
Saving output to Z:\AGS\AGS RNAseq CNIO\raw data\workflow_2021-03-23-10-45-25.txt. Please monitor it there...

Using default tag: latest
latest: Pulling from smithlab/spritz
Digest: sha256:55172c3a6e32257f977c9512e473f647eaeab32e35b6d96341598d6b96f97615
Status: Image is up to date for smithlab/spritz:latest
docker.io/smithlab/spritz:latest
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 12
Rules claiming more threads will be scaled down.
Conda environments: ignored
Job counts:
    count   jobs
    1   all
    1   base_recalibration
    1   build_transfer_mods
    1   call_gvcf_varaints
    1   call_vcf_variants
    1   download_snpeff
    1   final_vcf_naming
    1   finish_variants
    1   generate_reference_snpeff_database
    1   hisat2_groupmark_bam
    1   reference_protein_xml
    1   split_n_cigar_reads
    1   tmpdir
    1   transfer_modifications_variant
    1   variant_annotation_ref
    15

ue Mar 23 09:45:38 2021]
rule download_snpeff:
    output: SnpEff/snpEff.config, SnpEff/snpEff.jar, SnpEff_4.3_SmithChemWisc_v2.zip
    log: data/SnpEffInstall.log
    jobid: 5


ue Mar 23 09:45:38 2021]
rule build_transfer_mods:
    output: TransferUniProtModifications/TransferUniProtModifications/bin/Release/netcoreapp3.1/TransferUniProtModifications.dll
    log: data/TransferUniProtModifications.build.log
    jobid: 72
    benchmark: data/TransferUniProtModifications.build.benchmark


ue Mar 23 09:45:38 2021]
rule tmpdir:
    output: tmp, temporary
    log: data/tmpdir.log
    jobid: 68

Removing temporary output file temporary.
ue Mar 23 09:45:38 2021]
Finished job 68.
1 of 15 steps (7%) done

ue Mar 23 09:45:38 2021]
rule hisat2_groupmark_bam:
    input: analysis/align/combined.sorted.bam, tmp
    output: analysis/variants/combined.sorted.grouped.bam, analysis/variants/combined.sorted.grouped.bam.bai, analysis/variants/combined.sorted.grouped.marked.bam, analysis/variants/combined.sorted.grouped.marked.bam.bai, analysis/variants/combined.sorted.grouped.marked.metrics
    log: analysis/variants/combined.sorted.grouped.marked.log
    jobid: 16
    benchmark: analysis/variants/combined.sorted.grouped.marked.benchmark
    wildcards: dir=analysis
    resources: mem_mb=16000

ue Mar 23 09:45:50 2021]
Finished job 72.
2 of 15 steps (13%) done
Removing temporary output file SnpEff_4.3_SmithChemWisc_v2.zip.
ue Mar 23 09:46:31 2021]
Finished job 5.
3 of 15 steps (20%) done

ue Mar 23 09:46:31 2021]
rule generate_reference_snpeff_database:
    input: SnpEff/snpEff.jar, data/ensembl/Homo_sapiens.GRCh38.97.gff3, data/ensembl/Homo_sapiens.GRCh38.pep.all.fa, data/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.fa
    output: SnpEff/data/Homo_sapiens.GRCh38/protein.fa, SnpEff/data/Homo_sapiens.GRCh38/genes.gff, SnpEff/data/genomes/Homo_sapiens.GRCh38.fa, SnpEff/data/Homo_sapiens.GRCh38/doneHomo_sapiens.GRCh38.txt
    log: SnpEff/data/Homo_sapiens.GRCh38/snpeffdatabase.log
    jobid: 4
    benchmark: SnpEff/data/Homo_sapiens.GRCh38/snpeffdatabase.benchmark
    resources: mem_mb=16000

ue Mar 23 09:50:32 2021]
Finished job 4.
4 of 15 steps (27%) done

ue Mar 23 09:50:32 2021]
rule reference_protein_xml:
    input: SnpEff/data/Homo_sapiens.GRCh38/doneHomo_sapiens.GRCh38.txt, SnpEff/snpEff.jar, data/ensembl/Homo_sapiens.GRCh38.dna.primary_assembly.karyotypic.fa, TransferUniProtModifications/TransferUniProtModifications/bin/Release/netcoreapp3.1/TransferUniProtModifications.dll, data/uniprot/Homo_sapiens.protein.xml.gz
    output: analysis/variants/doneHomo_sapiens.GRCh38.97.txt, analysis/variants/Homo_sapiens.GRCh38.97.protein.xml, analysis/variants/Homo_sapiens.GRCh38.97.protein.xml.gz, analysis/variants/Homo_sapiens.GRCh38.97.protein.fasta, analysis/variants/Homo_sapiens.GRCh38.97.protein.withdecoys.fasta, analysis/variants/Homo_sapiens.GRCh38.97.protein.withmods.xml, analysis/variants/Homo_sapiens.GRCh38.97.protein.withmods.xml.gz
    log: analysis/variants/Homo_sapiens.GRCh38.97.spritz.log
    jobid: 74
    benchmark: analysis/variants/Homo_sapiens.GRCh38.97.spritz.benchmark
    wildcards: dir=analysis
    resources: mem_mb=16000

Removing temporary output file analysis/variants/Homo_sapiens.GRCh38.97.protein.xml.
Removing temporary output file analysis/variants/Homo_sapiens.GRCh38.97.protein.withmods.xml.
ue Mar 23 10:08:21 2021]
Finished job 74.
5 of 15 steps (33%) done
time="2021-03-23T15:01:58+01:00" level=error msg="error waiting for container: invalid character 'u' looking for beginning of value"
Done!
animesh commented 3 years ago

Looks like it config.zip finally worked 2021-08-13T153310.143502.snakemake.log @acesnik 👍🏽

I am guessing that following are the results for the variant calling?

(spritzbase) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ ls -ltrh /home/animeshs/rnAGS/final/
total 1.1G
-rwxrwxrwx 1 animeshs animeshs 552M Aug 16 16:21 combined.spritz.snpeff.vcf
-rwxrwxrwx 1 animeshs animeshs  65M Aug 16 16:21 combined.spritz.snpeff.protein.fasta
-rwxrwxrwx 1 animeshs animeshs 150M Aug 16 16:21 combined.spritz.snpeff.protein.withdecoys.fasta
-rwxrwxrwx 1 animeshs animeshs  78M Aug 16 16:21 combined.spritz.snpeff.protein.withmods.xml.gz
-rwxrwxrwx 1 animeshs animeshs  34M Aug 16 16:21 Homo_sapiens.GRCh38.100.protein.fasta
-rwxrwxrwx 1 animeshs animeshs  79M Aug 16 16:21 Homo_sapiens.GRCh38.100.protein.withdecoys.fasta
-rwxrwxrwx 1 animeshs animeshs  76M Aug 16 16:21 Homo_sapiens.GRCh38.100.protein.withmods.xml.gz

If so, are the following numbers of protein sequences look reasonable to you?

(spritzbase) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ for i in  /home/animeshs/rnAGS/final/*.fasta; do echo $i; grep "^>" $i | wc; done
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.fasta
  63305  437182 6560454
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.withdecoys.fasta
 152804  611216 14673896
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.fasta
  73488  585678 29130709
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.withdecoys.fasta
 176685 1069046 67683537

The with decoys numbers are confusing me though, like 152804 is off by 26194 if the reverse is included like 2*63305 or does it contain the variants too? The number difference is 29709 for the last two...

How are the variants being encoded in the fasta files? Is there a straightforward within Spritz way to summarize those?

acesnik commented 3 years ago

These two files are actually the results of different programs, which should probably be clearer in the filenames. The first ones (Homo_sapiens.GRCh38.100.protein.fasta, combined.spritz.snpeff.protein.fasta) are produced by Spritz's fork of SnpEff that does variant annotation, and the second batch (Homo_sapiens.GRCh38.100.protein.withdecoys.fasta, combined.spritz.snpeff.protein.withmods.xml.gz, etc) is produced by SpritzModifications (formerly TransferUniProtModifications) that applies them to proteins. The reason for the discrepancy in counts is that SpritzModifications does limited combinatorial expansion of heterozygous variations, whereas SnpEff does not.

If you check grep "^>mz" $i | wc for targets, versus grep "^>rev_mz" $i | wc for decoys, you can verify that the number of decoys is the same as the number of targets (e.g., 76402 for both for Homo_sapiens.GRCh38.100.protein.withdecoys.fasta).

acesnik commented 3 years ago

If you want to use the FASTA file with variants but without decoys, I would recommend selecting grep "^>mz" combined.spritz.snpeff.protein.withdecoys.fasta > combined.spritz.snpeff.protein.spritzmods.fasta.

animesh commented 3 years ago

Great @acesnik , 152804/2=>76402 matches perfectly though 88340 and 88345 don't, does it mean there are only 5 variants?

(spritzbase) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ for i in  /home/animeshs/rnAGS/final/*.fasta; do echo $i;grep "^>mz" $i | wc; done
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.fasta
  63305  437182 6560454
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.withdecoys.fasta
  76402  305608 7184144
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.fasta
  73488  585678 29130709
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.withdecoys.fasta
  88340  443908 33028098
(spritzbase) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ for i in  /home/animeshs/rnAGS/final/*.fasta; do echo $i;grep "^>rev_mz" $i | wc; done
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.fasta
      0       0       0
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.withdecoys.fasta
  76402  305608 7489752
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.fasta
      0       0       0
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.withdecoys.fasta
  88345  625138 34655439

do you think the results are fine in general? Should i go forward with /home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.withdecoys.fasta* none the less?

animesh commented 3 years ago

Thanks, looks like you replied in between @acesnik . So i am guessing the variants are being encoded within the ^mz_* ? How to extract them and confirm, any way within Spritz?

acesnik commented 3 years ago

Oh, that's interesting about the extra decoys. There is a process of reversing the variants in decoy generation, and that might change the length and thus the filtering based on having proteins >7 AAs. I'll look into that.

The decoy discrepancy is pretty small, so it might not have the biggest effect, but like I mentioned, you could select just the targets with grep "^>mz" combined.spritz.snpeff.protein.withdecoys.fasta > combined.spritz.snpeff.protein.spritzmods.fasta.

All target proteins are encoded with "^>mz", and all decoys are encoded with "^>rev_mz". Variants will have the tag "variant:" in the header, so to check for the count of proteins with variants, I'd recommend doing grep "^>mz" $i | grep "variant:" | wc.

acesnik commented 3 years ago

In typical human experiments, I usually find about a quarter of the entries have variants.

animesh commented 3 years ago

Awesome @acesnik 👍🏽 Looks like there are about 15000 variants

(spritzbase) animeshs@DMED7596:/mnt/z/Spritz/Spritz/workflow$ for i in  /home/animeshs/rnAGS/final/*.fasta; do echo $i;grep "^>mz" $i | grep "variant:" | wc ; done
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.fasta
      0       0       0
/home/animeshs/rnAGS/final/Homo_sapiens.GRCh38.100.protein.withdecoys.fasta
      0       0       0
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.fasta
  14695  179491 23040270
/home/animeshs/rnAGS/final/combined.spritz.snpeff.protein.withdecoys.fasta
  17324  159844 26350306

which is less than expected then? Probably the read depth was not enough, where to check for the mapping metrics and play with thresholds used for variant-calling? Is there some easy to configure within the config.yaml?

acesnik commented 3 years ago

That's right around where I would expect. Probably 15-30% of entries or something. I wouldn't try changing any of the thresholding, personally.

It's actually probably more than 15000 variants, since multiple are encoded per protein. You can check on some other figures in the output of SpritzModifications at combined.spritz.snpeff.protein.withmods.log. Right now, it has one summary for the targets only, and then another one for both targets and decoys (which you can ignore).

The first one also tells you the number of variant containing proteins (hopefully the same number as you saw above), the number of unique variants on those proteins, and the number of unique variants by type: missense, frameshift, deletion, etc.

acesnik commented 3 years ago

You can also take a step back from there and look at the numbers of variants detected before applying them to proteins by opening combined.spritz.snpeff.html, which is a report from SnpEff variant annotation.

animesh commented 3 years ago

The numbers seem to be matching withmods.log

Welcome to SpritzModifications!
Transfering modifications from UniProt database ../resources/uniprot/Homo_sapiens.protein.xml.gz to /home/animeshs/rnAGS/variants/combined.spritz.snpeff.protein.xml
76402   Canonincal proteins translated from gene model (without applied variations)
46883   Proteins with exact sequence match in UniProt
16422   Proteins without exact sequence match in UniProt
Analyzing resulting database /home/animeshs/rnAGS/variants/combined.spritz.snpeff.protein.withmods.xml
Spritz Database Summary
--------------------------------------------------------------
63305   Total number of canonical protein entries (before applying variations)
73491   Total number of protein entries
51237   Total modifications appended from UniProt out of 53436
10186   Total number of variant containing protein entries
13729   Total number of unique variants
231 Total number of unique synonymous variants
13498   Total number of unique nonsynonymous variants
12833   Number of unique SNV missense variants
188 Number of unique MNV missense variants
346 Number of unique frameshift variants
2   Number of unique insertion variants
21  Number of unique deletion variants
74  Number of unique stop gain variants
34  Number of unique stop loss variants
Spritz Database Summary
--------------------------------------------------------------
126610  Total number of canonical protein entries (before applying variations)
146983  Total number of protein entries
100982  Total modifications appended from UniProt out of 105326
20373   Total number of variant containing protein entries
27584   Total number of unique variants
474 Total number of unique synonymous variants
27110   Total number of unique nonsynonymous variants
25781   Number of unique SNV missense variants
378 Number of unique MNV missense variants
691 Number of unique frameshift variants
4   Number of unique insertion variants
40  Number of unique deletion variants
148 Number of unique stop gain variants
68  Number of unique stop loss variants

I will check the HTML too, thanks a lot @acesnik 👍🏽