Adapting workflow to other species

vmkalbskopf commented 3 years ago

This looks like a comprehensive workflow. I'd like to use it to do DE analysis and variant calling for other species, like non-human malaria. What would I need to do to make that happen?

sanjaynagi commented 3 years ago

Hi there! :)

currently there is a limitation of the pipeline is that it assumes we have genes anchored to chromosomes. This is not a problem for DE and related analysis but becomes so for variant calling and the subsequent analyses, as the workflow splits the genome by chromosome for convenience. This would be fine for falciparum/vivax but I assume your non-human malaria species does not have genes anchored to chromosomes. is this the case?

However, the only real part of the pipeline which truly requires genes->chromosomes is the windowed Fst and PBS analysis to detect selection. This could still be done solely at the gene-level, so it would definitely be possible to alter the workflow to work for non-model organisms. However, I think this would take a decent amount of work to allow the two options seamlessly (model v non-model). Its something that I have been thinking about implementing, but unfortunately I don't currently have the time to do so. If you know snakemake and wanted to give it a go I'd be happy to help.

Cheers Sanj

vmkalbskopf commented 3 years ago

I'm not sure what you mean by 'genes anchored to a chromosome'. My species has a published, annotated genome which I map to. Is that sufficient?

sanjaynagi commented 3 years ago

Apologies - what I mean is that, for example, the plasmodium falciparum 3D7 reference genome has 14 full chromosomes, almost all genes are assigned to a chromosome, and their specific position on the chromosome is known. In this workflow, currently, 14 different VCFs for each chromosome would be produced.

Alternatively, in less-well annotated organisms, instead of a few full length chromosomes, the genome assembly might have hundreds to thousands of small scaffolds. Currently, that would make it problematic for the workflow, and for interpreting the results. I hope that helps.

What is your malaria species?

vmkalbskopf commented 3 years ago

Ah I see. I work on Plasmodium relictum, avian malaria. As you can see, I have 14 chromosomes and the mito and apicoplast chromosomes. Is there a way I could remove the 460 short scaffolds after mapping to simplify the representation? Or perhaps the the short scaffolds could be concatenated into a meta-chromosome for simpler representation of results?

Anyway, I don't think that is the hardest part for me. The part I am struggling with is getting started. How should I remove the analyses from the workflow that I can't use, like the resistance genes, or account for the lack a of pre-existing snp database?

EDIT: Ah I see there is an 'Activate' option in the example config file. Very helpful!

sanjaynagi commented 3 years ago

Ok, awesome! Well certainly if you just list the 14 chromosomes + MIto + apicoplast in the config file, for now, the other small scaffolds will be ignored for the purposes of the variant calling and extra variant analyses. Genes on the small scaffolds will still be used for things like differential expression, etc.

Apologies - I am in the process of writing a README in the config/ folder, with documentation on how to configure the workflow (currently there is not a lot of documentation, so im not surprised if it might be confusing!). This is a priority and so hoping to get this done in the next week or so. I'm also writing the manuscript for the workflow which may also aid interpretation. Anyway, if you want to give it a go, ill try and give you a hand where there are issues! :)

vmkalbskopf commented 3 years ago

Thank you so much! I am definitely going to try.

vmkalbskopf commented 3 years ago

When I try to create a template clone of the directory using the github cli app, I get his error: GraphQL error: Could not clone: sanjaynagi/rna-seq-ir is not a template repository. I think you need to mark this repo as a template repo by doing this.

Btw, the link you have in the README about cloning/forking the directory as a template does not help. Github's UI has changed, or perhaps it's because you have not made this repo a template repo.

sanjaynagi commented 3 years ago

Thank you, I've made it a template now, but would recommend maybe just doing a git clone --recursive for now. I'm not fully sure of the purpose of the templating, this was residual from the snakemake cookie-cutter template.

I'll edit the readme accordingly.

vmkalbskopf commented 3 years ago

OK, I've hit my first speed bump. I've adapted the config file, all the resources, sample name and gene name files. then I try the dry run. And I get this error.

snakemake --use-conda -n
---------------------------- RNA-Seq-IR ----------------------------
Running RNA-Seq-IR snakemake workflow in /scratch/victor/rna-seq-pr/workflow

Author:   Sanjay Curtis Nagi
Workflow Version: v0.4.0
Execution time:  2021-05-27 14:27:38.687554
Dataset: six_nonsiskins

AttributeError in line 40 of /scratch/victor/rna-seq-pr/workflow/rules/variantAnalysis.smk:
'list' object has no attribute 'Name'
  File "/scratch/victor/rna-seq-pr/workflow/Snakefile", line 32, in <module>
  File "/scratch/victor/rna-seq-pr/workflow/rules/variantAnalysis.smk", line 40, in <module>

This is line 40 of variantAnalysis.smk mut=mutationData.Name,

Here is the config file. Ignore the .txt at the end. Just needed that to upload to github. config.yaml.txt

sanjaynagi commented 3 years ago

Ok - this was a bug relating to that rule mpileupIR still needing a mutations file even when IRmutations is inactivated in the config. Strange, as snakemake shouldnt really be needing to look for that when the rule isnt gonna be ran. I've just pushed a temporary fix to github which reads in the exampleMutations.tsv file, if you want to do a git pull! :)

vmkalbskopf commented 3 years ago

Thanks!

vmkalbskopf commented 3 years ago

I'm not a frequent git user, so I'm missing something obvious. First I : git clone --recursive https://github.com/sanjaynagi/rna-seq-ir.git then I git pull But it says Already up to date. How do I git pull the latest version?

sanjaynagi commented 3 years ago

Is this still the case? sorry, i find it difficult to check these things, as being the repo owner gives me more permissions...

if you run grep example workflow/Snakefile the result should be mutationData = pd.read_csv("resources/exampleMutations.tsv", sep="\t") if the fix has been incorporated.

otherwise, does completing the steps in section 6 of the README help at all? Thanks, I appreciate your help in testing this!

vmkalbskopf commented 3 years ago

It's still the old Snakefile file. This is what happened when trying to follow section 6


git remote add -f upstream https://github.com/snakemake-workflows/rna-seq-ir.git
fatal: remote upstream already exists.

git fetch upstream
ERROR: Repository not found.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

I feel really stupid..

Of course I can manually make the change to the Snakefile, but that is a bit of a hack.

sanjaynagi commented 3 years ago

Dont feel stupid!! :) its my fault that I havent sorted this yet, you're my first user of the workflow outside of my department, thats all. Im going to try and use a different machine without my github ssh so i can test these steps.

In the meantime - do you want to make a backup of your config.yaml, samples.tsv, and anything extra you've put in the repo (reference files etc). And then remove the folder and do a fresh git clone --recursive ? hopefully it'll work then.

sanjaynagi commented 3 years ago

out of interest - did you originally do a git clone --recursive thing to get the repo or did you do the template thing?

sanjaynagi commented 3 years ago

You did a template thing I see. I've just been attempting to do the same and Im struggling to get it working. Id recommend just doing git clone --recursive on the rna-seq-ir (or forking it), and im gonna remove all the templating advice from the README

vmkalbskopf commented 3 years ago

out of interest - did you originally do a git clone --recursive thing to get the repo or did you do the template thing?

At first I did a template, but since I complained about the upstream issue, I've done the git clone --recursive.

Snakemake is running now. But I'm getting this error. Perhaps you can make it optional to use example mutations, since I don't have any pre-existing snp database.

snakemake -n --use-conda
FileNotFoundError in line 24 of /scratch/victor/rna-seq-ir/workflow/Snakefile:
[Errno 2] No such file or directory: 'resources/exampleMutations.tsv'
  File "/scratch/victor/rna-seq-ir/workflow/Snakefile", line 24, in <module>
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/parsers.py", line 610, in read_csv
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/parsers.py", line 462, in _read
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/parsers.py", line 819, in __init__
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/parsers.py", line 1050, in _make_engine
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/parsers.py", line 1867, in __init__
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/parsers.py", line 1362, in _open_handles
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/pandas/io/common.py", line 642, in get_handle

sanjaynagi commented 3 years ago

I will do. For now, can you add this into resources folder (remove the .txt). It wont be used. exampleMutations.tsv.txt

vmkalbskopf commented 3 years ago

That did not fix it. I still ge the same error. I also tried commenting out these 2 lines in the Snakemake file:

else:
    mutationData = pd.read_csv("resources/exampleMutations.tsv", sep="\t")

but then I get this error, because some rules still need mutation data.

NameError in line 40 of /scratch/victor/rna-seq-ir/workflow/rules/variantAnalysis.smk:
name 'mutationData' is not defined
  File "/scratch/victor/rna-seq-ir/workflow/Snakefile", line 31, in <module>
  File "/scratch/victor/rna-seq-ir/workflow/rules/variantAnalysis.smk", line 40, in <module>

sanjaynagi commented 3 years ago

Ive just added a fix which removes the need for exampleMutations and makes a dummy dataframe instead. can you try a git pull and try again :)

vmkalbskopf commented 3 years ago

The dry run completed successfully! Thank you for your help Sanjay.

vmkalbskopf commented 3 years ago

I found a Plasmodium relictum snpEff database called Plasmodium_relictum_gca_900005765

I can run snpEff download Plasmodium_relictum_gca_900005765 without error (though I don't know where the file goes to). However, when I try run the pipeline, that snpEff database download rule fails. This is the failing code:

snpEff download Plasmodium_relictum_gca_900005765 2> logs/snpEff/snpEffDbDownload.log

This is the log:

java.lang.RuntimeException: Property: 'Plasmodium_relictum_gca_900005765.genome' not found at org.snpeff.interval.Genome.(Genome.java:106) at org.snpeff.snpEffect.Config.readGenomeConfig(Config.java:681) at org.snpeff.snpEffect.Config.readConfig(Config.java:649) at org.snpeff.snpEffect.Config.init(Config.java:480) at org.snpeff.snpEffect.Config.(Config.java:117) at org.snpeff.SnpEff.loadConfig(SnpEff.java:451) at org.snpeff.snpEffect.commandLine.SnpEffCmdDownload.runDownloadGenome(SnpEffCmdDownload.java:80) at org.snpeff.snpEffect.commandLine.SnpEffCmdDownload.run(SnpEffCmdDownload.java:72) at org.snpeff.SnpEff.run(SnpEff.java:1183) at org.snpeff.SnpEff.main(SnpEff.java:162)

Do you know why the ".download" is added at the end of the name? Cause it seems that is the cause of the problem.

sanjaynagi commented 3 years ago

What did you put in the config.yaml under snpeff ? I think it might be just plasmodium_relictum as opposed to the full name but I could be wrong here. What does it show if you do snpEff databases | grep ‘plasmodium_relictum’ ? ( or whatever grep finds relictum). I’m on a train without laptop otherwise would have looked myself.

I think the snpeff version is the latest version (5.0) however there was a known bug in 4.5 in which it couldn’t find the Aedes aegypti genome

On 28 May 2021, at 15:06, vmkalbskopf @.***> wrote:

I found a Plasmodium relictum snpEff database called Plasmodium_relictum_gca_900005765

I can run snpEff download Plasmodium_relictum_gca_900005765 without error (though I don't know where the file goes to). However, when I try run the pipeline, that snpEff database download rule fails. This is the log:

java.lang.RuntimeException: Property: 'Plasmodium_relictum_gca_900005765.genome' not found at org.snpeff.interval.Genome.(Genome.java:106) at org.snpeff.snpEffect.Config.readGenomeConfig(Config.java:681) at org.snpeff.snpEffect.Config.readConfig(Config.java:649) at org.snpeff.snpEffect.Config.init(Config.java:480) at org.snpeff.snpEffect.Config.(Config.java:117) at org.snpeff.SnpEff.loadConfig(SnpEff.java:451) at org.snpeff.snpEffect.commandLine.SnpEffCmdDownload.runDownloadGenome(SnpEffCmdDownload.java:80) at org.snpeff.snpEffect.commandLine.SnpEffCmdDownload.run(SnpEffCmdDownload.java:72) at org.snpeff.SnpEff.run(SnpEff.java:1183) at org.snpeff.SnpEff.main(SnpEff.java:162)

Do you know why the ".download" is added at the end of the name? Cause it seems that is the cause of the problem.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

vmkalbskopf commented 3 years ago

Plasmodium_relictum_gca_900005765 is the record in the database when I grep for Plasmodium_relictum. And Plasmodium_relictum_gca_900005765 is what I put in the config file. Conda is installing version SnpEff 4.3t 2017-11-24 into the environment. Do you know why that might be?

I'll try specifying the latest version in the env file.

vmkalbskopf commented 3 years ago

OK! I think upgrading to a new version fixed that. New error.

Error.  nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)Error.  nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)[Fri May 28 17:49:48 2021]
Error in rule Ag1000gSweepsDE:
    jobid: 347
    output: results/genediff/ag1000gSweeps/cond1_cond2_swept.tsv
    log: logs/Ag1000gSweepsDE.log (check log file(s) for error message)
    conda-env: /scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930

I think those are two separate errors, because when I look at that log file, I get something that is unrelated to the number of threads (btw, I am running this on a system with 256 threads, and I'm specifying 100 jobs at a time for snakemake). Here is the log file:

Traceback (most recent call last):
  File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmpi1y8xn68.Ag1000gSweepsDE.py", line 24, in <module>
    signals = pd.read_csv("resources/signals.csv")
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/parsers.py", line 605, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/parsers.py", line 457, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/parsers.py", line 814, in __init__
    self._engine = self._make_engine(self.engine)
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/parsers.py", line 1045, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/parsers.py", line 1862, in __init__
    self._open_handles(src, kwds)
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/parsers.py", line 1363, in _open_handles
    storage_options=kwds.get("storage_options", None),
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/io/common.py", line 647, in get_handle
    newline="",
FileNotFoundError: [Errno 2] No such file or directory: 'resources/signals.csv'

There is indeed no signals.csv file there.

EDIT: I'm stupid. I forgot to disable the Ag1000g sweep in the config file.

vmkalbskopf commented 3 years ago

Still getting the error about the number of threads even when running with 63 threads.

sanjaynagi commented 3 years ago

Ok, it should be version 5.00... that must be why. And try with less threads perhaps. I probably won’t be responsive until Tuesday now as it’s holiday over here but good luck!

On 28 May 2021, at 16:43, vmkalbskopf @.***> wrote:

Plasmodium_relictum_gca_900005765 is the record in the database when I grep for Plasmodium_relictum. And Plasmodium_relictum_gca_900005765 is what I put in the config file. Conda is installing version SnpEff 4.3t 2017-11-24 into the environment. Do you know why that might be?

I'll try specifying the latest version in the env file.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

vmkalbskopf commented 3 years ago

Ok, it should be version 5.00... that must be why. And try with less threads perhaps. I probably won’t be responsive until Tuesday now as it’s holiday over here but good luck!

No worries. Enjoy the wonderful weather (I hope)!

I have restricted it to running with one core. Now it stops with the same error consistently:

Error.  nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)
-------------- Reading VCF for chromosome PRELSG_01_v1 --------------
------- Filtering VCF at QUAL=30 and missingness proportion of 0.8 -------
[Sat May 29 17:24:05 2021]
Error in rule WindowedFstPBS:
*I've removed the details about the job specs, jumping straight to the error*

log: logs/WindowedFstPCA.log (check log file(s) for error message)
    conda-env: /scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930

RuleException:
CalledProcessError in line 204 of /scratch/victor/rna-seq-ir/workflow/rules/variantAnalysis.smk:
Command 'source /home/victor/anaconda3/bin/activate '/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930'; set -euo pipefail;  python /scratch/victor/rna-seq-ir/.snakemake/scripts/tmp71yc1h33.WindowedFstPBS.py' returned non-zero exit status 1.
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2349, in run_wrapper
  File "/scratch/victor/rna-seq-ir/workflow/rules/variantAnalysis.smk", line 204, in __rule_WindowedFstPBS
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 569, in _callback
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/concurrent/futures/thread.py", line 52, in run
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2381, in run_wrapper
Shutting down, this might take some time.

Here is the log mentioned in the error:

Traceback (most recent call last):
  File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmp71yc1h33.WindowedFstPBS.py", line 45, in <module>
    missingfltprop=missingprop)
  File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 115, in readAndFilterVcf
    geno = allel.GenotypeArray(vcf['calldata/GT'].compress(passfilter, axis=0))
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/allel/model/ndarray.py", line 1476, in __init__
    check_ndim(self.values, 3)
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/allel/util.py", line 66, in check_ndim
    raise TypeError('bad number of dimensions: expected %s; found %s' % (ndim, a.ndim))
TypeError: bad number of dimensions: expected 3; found 2

I'm wondering if there is something amiss with the vcf file, causing numpy to spit out that error (bad number of dimensions: expected 3; found) as part of the WindowedFstPBS.

sanjaynagi commented 3 years ago

Hey Victor,

I realised that the reason snpEff was v4.3.1 was because I had changed it to 5.0.0, but then snpSift latest conda version is 4.3.1, and it complained of being incompatible (snpSift is used later for the differential SNPs module which AFAIK is not working right now, I need to fix). I might try and remove the snpSift rule as its just filtering and i assume bcftools will be able to do the same.

re the latest error - what level of ploidy did you use in the config? if 1, this could be why, although allel.GenotypeArray() should work with variable ploidy. Are these pooled samples? I guess with plasmodium you need to pool lots of parasites for RNA-Seq? is that right?

sanjaynagi commented 3 years ago

Im incorrect - GenotypeArray needs a ploidy of ABOVE 1 (I didnt know about this), and I should instead use a haplotypeArray. I will implement this as its important, but I do think if you are using pooled data then it might actually make more sense to use a higher ploidy, as allele frequencies in the sample will be captured more accurately then if we try and force it to call one allele only.

vmkalbskopf commented 3 years ago

Hi Sanjay

I did indeed specify a ploidy of one. The samples are not pooled, instead we sequence each sample very deeply (50x -80x). They are blood samples are from infected hosts. Each host will have a multitude of parasites that get sequenced, and this is RNA-seq so ploidy is tricky.. Not sure what ploidy to use.

I've rereun with a ploidy of 5 (why not). The filtering is working now, but I'm still getting the error about the number of threads: Error. nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64) I ran it with 1 cpu.

sanjaynagi commented 3 years ago

OK, in that case we can probably assume they are (relatively) clonal? and therefore I do agree that a ploidy of 1 makes the most sense. You'll have to give me a couple of days to implement.

Looks as though the nthreads issue is related to this https://github.com/cggh/scikit-allel/issues/285 . I haven't run before on a machine with more than 64 threads so haven't come across it.

supposedly adding this to the top of the python script can fix it. Could you try and manually edit the workflow/scripts/tools.py script to do the following directly at the top? if it works, I could add it to the workflow. I think change 272 to whatever number of cores you have maybe, but even if not it should be OK.

import os os.environ["NUMEXPR_MAX_THREADS"]="272" import allel

vmkalbskopf commented 3 years ago

Specifying the environmental core count did the trick!

Now onto the next hiccup.

Traceback (most recent call last):
  File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmp6xi5rrhl.WindowedFstPBS.py", line 45, in <module>
    missingfltprop=missingprop)
  File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 123, in readAndFilterVcf
    ac = geno.count_alleles()
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/allel/model/ndarray.py", line 1839, in count_alleles
    max_allele = self.max()
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/numpy/core/_methods.py", line 39, in _amax
    return umr_maximum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation maximum which has no identity

I'm going to rerun all the variant analysis steps, in case there are issues with uncompleted jobs that aren't cleaned up correctly. If I say nothing after this, it didn't help.

sanjaynagi commented 3 years ago

Yo. Ive edited the pipeline and it should now work with haploids :)

I don't know if you'd prefer to try that! Otherwise I'll need to get back to you on the other issue.

vmkalbskopf commented 3 years ago

Yo. Ive edited the pipeline and it should now work with haploids :)

I don't know if you'd prefer to try that! Otherwise I'll need to get back to you on the other issue.

Working on haploids is great! I'll pull that change :-)

vmkalbskopf commented 3 years ago

I've discovered that the snpEff Plasmodium relictum database is (unusably) out of date. I am building a new one based on the latest genome and annotations. But snpEff stores it's databases in it's own working directory, which gets regenerated each time the conda environment is created. This makes it hard to specify a custom snpEff database on the fly, as a parameter in a Snakemake rule.

Of course, one could simply add a rule that builds the database based on params from the config file. Would you like me to try my hand at that and push it to you? Should this be its own issue?

sanjaynagi commented 3 years ago

Yeah sure :) would you mind making a new issue re allowing custom snpEff databases and then have a go? Thanks

vmkalbskopf commented 3 years ago

Latest error when running StatisticsAndPCA.py:

Traceback (most recent call last):
  File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmpjm6zc5wp.StatisticsAndPCA.py", line 59, in <module>
    missingfltprop=missingprop)
  File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 128, in readAndFilterVcf
    ac = geno.count_alleles()
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/allel/model/ndarray.py", line 2394, in count_alleles
    max_allele = self.max()
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/numpy/core/_methods.py", line 39, in _amax
    return umr_maximum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation maximum which has no identity

vmkalbskopf commented 3 years ago

I'm all of a sudden running into a new error that has not occured before, which doesn't make sense, because they used to work. It applies to both DifferentialIsoformExpression and DifferentialGeneExpression

Error in rule DifferentialIsoformExpression:
    jobid: 228
    output: results/isoformdiff/cond1_cond2.csv, results/isoformdiff/six_nonsiskins_isoformdiffexp.xlsx
    log: logs/DifferentialIsoformExpression.log (check log file(s) for error message)
    conda-env: /scratch/victor/rna-seq-ir/.snakemake/conda/7a5c81fd8601368f4ba91176197ab591

RuleException:
CalledProcessError in line 100 of /scratch/victor/rna-seq-ir/workflow/rules/diff.smk:
Command 'source /home/victor/anaconda3/bin/activate '/scratch/victor/rna-seq-ir/.snakemake/conda/7a5c81fd8601368f4ba91176197ab591'; set -euo pipefail;  Rscript --vanilla /scratch/victor/rna-seq-ir/.snakemake/scripts/tmpnn9aq7u2.SleuthIsoformsDE.R' returned non-zero exit status 1.
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2349, in run_wrapper
  File "/scratch/victor/rna-seq-ir/workflow/rules/diff.smk", line 100, in __rule_DifferentialIsoformExpression
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 569, in _callback
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/concurrent/futures/thread.py", line 52, in run
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
  File "/home/victor/anaconda3/envs/sn_env/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 2381, in run_wrapper

Here is the relevant part of the log:

Joining, by = "GeneID"
Error: Problem with `mutate()` column `Gene_name`.
ℹ `Gene_name = case_when(...)`.
✖ must be a character vector, not a logical vector.
Backtrace:

This script has quite a few input files:

config/samples.tsv, resources/GeneNames.tsv, results/quant/R47, results/quant/R48, results/quant/R49, results/quant/R50, results/quant/R51, results/quant/R52

Here's the samples file:

samples treatment       species strain
R47     cond1   prelictum       SGS1
R48     cond1   prelictum       SGS1
R49     cond1   prelictum       SGS1
R50     cond2   prelictum       SGS1
R51     cond2   prelictum       SGS1
R52     cond2   prelictum       SGS1

and part of GeneNames.tsv file

Gene_stable_ID  Gene_name       Gene_description
PRELSG_03_v1    ""      conserved Plasmodium protein%2C unknown function
PRELSG_14_v1    ""      conserved Plasmodium protein%2C unknown function
PRELSG_08_v1    ""      conserved Plasmodium protein%2C unknown function
PRELSG_02_v1    ""      transcription initiation factor TFB%2C putative
PRELSG_12_v1    ""      GTP-binding protein%2C putative

Here's a quant file abundance.tsv:

target_id       length  eff_length      est_counts      tpm
PRELSG_0100100.1        1764    1568.65 0       0
PRELSG_0100200.1        73      29.154  0       0
PRELSG_0100300.1        84      32.6917 0       0
PRELSG_0100400.1        312     119.706 0       0
PRELSG_0100500.1        5073    4877.65 0       0
PRELSG_0100600.1        534     338.669 0       0
PRELSG_0100700.1        3777    3581.65 0       0
PRELSG_0100800.1        420     224.851 0       0
PRELSG_0100900.1        2061    1865.65 0       0
PRELSG_0101000.1        6195    5999.65 14      3.04701
PRELSG_0101100.1        786     590.645 24      53.0586

sanjaynagi commented 3 years ago

The last error should now be fixed I think, the case_when statement was not evaluating quite right. Try a git pull :)

FYI I've changed the 'samples' column to be named 'sampleID' in the samples.tsv file, so that will need changing.

Could you also try a much lower missingness (under pbs: missingness: in config file, I probably need to change that as its confusing) and see if the StatisticsAndPCA works then? thanks.

vmkalbskopf commented 3 years ago

Thanks!

The differential expression is still giving the same error:

Joining, by = "sampleID"
 --- Running DESeq2 differential expression analysis on cond1_cond2 ---
Joining, by = "GeneID"
Joining, by = "GeneID"
Error: Problem with `mutate()` column `Gene_name`.
ℹ `Gene_name = case_when(...)`.
✖ must be a character vector, not a logical vector.

As far as I can tell, I am using the latest version, as the modified time stamps on the rule files are from today.

sanjaynagi commented 3 years ago

Morning Victor. Could you send over an abundance.tsv file and also your full gene_names.tsv file? I need to fix something relating to mapping transcripts to genes.

vmkalbskopf commented 3 years ago

gene_names.tsv.txt abundance.tsv.txt

sanjaynagi commented 3 years ago

Thanks Victor. I've edited the workflow - previously it was only suited to using geneIDs from VectorBase.

it now needs the gene names file in a slightly different format which maps genes to transcripts - tab separated 4 columns - GeneID, TranscriptID, GeneName, GeneDescription although the description is optional. In the config file it is now called genes2transcripts also instead of gene_names. this will be in the exampleconfig that gets pulled, and there is an example Gene2TranscriptMap.tsv.

This should fix it but let me know if it doesnt.

edit: just had a look at your gene_names file and there is no column names? which would mean it definitely wouldn't work anyway. But it needed the above change in any case.

vmkalbskopf commented 3 years ago

OK, I actually sent you the wrong gene names file. Mine is called GeneNames.tsv It has the header which gene_names.tsvwas missing.

Gene_stable_ID  Gene_name       Gene_description
PRELSG_03_v1    ""      conserved Plasmodium protein%2C unknown function
PRELSG_14_v1    ""      conserved Plasmodium protein%2C unknown function
PRELSG_08_v1    ""      conserved Plasmodium protein%2C unknown function

But this is irrelevant now anyway. My genes don't have canonical names, or at least not ones I can quickly find, so I assumed that the "" would be enough to specify an empty value.

I'll attempt the new version.

vmkalbskopf commented 3 years ago

I don't think you've pushed your changes to the repo.

sanjaynagi commented 3 years ago

whoops. good point! Done now. yeah, empty gene names is fine :)

vmkalbskopf commented 3 years ago

When executing Differential expression rule:


------------- Kallisto - DESeq2 - RNASeq Differential expression ---------
Joining, by = "TranscriptID"
Error in `.rowNamesDF<-`(x, value = value) :
  missing values in 'row.names' are not allowed
Calls: %>% ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
Execution halted

When executing Isoform Differential expression rule:

Joining, by = "TranscriptID"
Error: Problem with `mutate()` column `Gene_name`.
ℹ `Gene_name = case_when(...)`.
✖ must be a character vector, not a logical vector.

vmkalbskopf commented 3 years ago

genesList.tsv.txt Here is the Gene2transcripts file.

sanjaynagi / rna-seq-pop

Adapting workflow to other species #38