Closed vmkalbskopf closed 2 years ago
The transcriptIDs should match with that found in abundance.tsv - Kallisto should be mapping to the reference transcriptome fasta file, and so the 'target_id' from Kallisto should match with the fasta headers.
It was reporting the exon IDs before, but now it's reporting the identical ID in the transcript file. But I am getting the same error as before.
Gene diff expr.
------------- Kallisto - DESeq2 - RNASeq Differential expression ---------
Joining, by = "TranscriptID"
Error in `.rowNamesDF<-`(x, value = value) :
missing values in 'row.names' are not allowed
Calls: %>% ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
Execution halted
Isoform DE:
Joining, by = "TranscriptID"
Error: Problem with `mutate()` column `Gene_name`.
ℹ `Gene_name = case_when(...)`.
✖ must be a character vector, not a logical vector.
difficult to figure out whats going on, without being able to run the script. Do you have any missing rows or duplicates in the Transcript ID column? I would try and run the script in rstudio and see where it falters, just edit the snakemake@input bits to be the paths to your files
How do I edit the snakemake@input bits' to be paths to my files? I'm quite lost.
I see the lines
metadata = fread(snakemake@input[['metadata']], sep="\t") %>% as.data.frame()
gene_names = fread(snakemake@input[['genes2transcripts']], sep="\t") %>% distinct()
contrastsdf = data.frame("contrast" = snakemake@params[['DEcontrasts']])
but I don't know how to change them to refer to the file paths instead.
metadata
and genes2transcripts
refers to inputs in the rules.
I have a list of input files generated when the rule fails:
input: config/samples.tsv, resources/Gene2TranscriptMap.tsv, results/quant/R47, results/quant/R48, results/quant/R49, results/quant/R50, results/quant/R51, results/quant/R52, results/quant/B7, results/quant/B8, results/quant/B9
EDIT!
I'm figuring it out...
Hey Victor. What i mean, is to open the Rscript outside of the snakemake pipeline in Rstudio. You may want to make a copy of the script of or something.
then change
metadata = fread(snakemake@input[['metadata']], sep="\t") %>% as.data.frame()
to
metadata = fread("config/samples.tsv", sep="\t") %>% as.data.frame()
contrastsdf = data.frame("contrast" = list("cond1_cond2"))
and do similar for gene_names (genes2transcript file). Then you should be able to run the Rscript step by step (you may need to install the packages), and it will be clearer where the script is failing.
Yup! You're too nice Sanjay, and I'm too impatient. I struggled a bit after asking for help, and figured it.
I've gotten to the same point in the script where it's failing when run with sn.
I'll now try figure out why it's saying there are missing row names.
>sum(is.na(counts$GeneID))
[1] 1
> sum(is.na(gene_names$GeneID))
[1] 0
The very last row of counts
has a missing GeneID.
5177 | PRELSG_MIT00300 | 2108.0000 | 1961 | 1186.00000 | 3249.000 | 1.24500e+03 | 1341.00000
5178 | NA | 53001.0000 | 77939 | 32034.00000 | 46699.000 | 1.09758e+05 | 38210.00000
I hate off-by-one errors. I'm trying to figure out if there there's a line in my Gene2TranscriptMap.tsv that's missing a GeneID.
EDIT I can't find the missing gene ID. I visually inspected the whole file twice.
I have 5312 entries in the abundance.tsv file. None of the transcript ID's repeat, but I have 5178 entries in the Gene2Transcript file. All of the genome resources are the same version. I'm lost now.
I've just stripped them from the PrelictumSG1-like transcriptome.fasta. it has 5312 entries now, but you'll have to do some sort of join if you want gene names and descriptions.
i did the following....
grep -e ">" PlasmoDB-52_PrelictumSGS1-like_AnnotatedTranscripts.fasta | cut -c 2-50 > prelictum_headers.fa
then in r...
library(data.table)
library(tidyverse)
names = fread("~/prelictum_headers.fa", header = FALSE)[,1:2]
colnames(names) = c("TranscriptID", "GeneID")
names$GeneID = str_remove(names$GeneID, "gene=")
names$GeneName = ""
fwrite(names, "~/Prelictum_Genes2TranscriptMap.tsv", sep="\t", col.names = TRUE)
Huh.. It works now. Thanks.
I built my Genes2Transcripts file from the gff file. And that has 5178 mRNA entries. This is not relevant to your project, but I wonder why the gff file has fewer transcripts than the Annotated Transcripts fasta file.
I've had a frustrating few days trying to figure out why the code running on the local R session worked, but not when run part of the sm workflow. Well, turns out it actually wasn't running all the way through locally. Here is the error while running the DESeq2 function:
--- Running DESeq2 differential expression analysis on cond1_cond2 ---
Joining, by = "GeneID"
Joining, by = "GeneID"
Error: Problem with `mutate()` input `Gene_name`.
x must be a character vector, not a logical vector.
ℹ Input `Gene_name` is `case_when(...)`.
I get this identical error locally and in the SM workflow. I don't have canonical gene names. but not sure if that is what the error is referring too. It's just the GeneID, right?
I think I found the problem.
labels = results %>% dplyr::mutate("Gene_name" = case_when(GeneName == "" ~ GeneID,
is.na(GeneName) ~ GeneID,
TRUE ~ GeneID)) %>% select(Gene_name) %>% deframe()
I changed TRUE ~ Gene_name
to TRUE ~ GeneID
and it seems to work now.
I was getting nearly the identical errors for the Isoform difexpr rule.
So I made the same change, swapping out GeneID
for TranscriptID
in the same function.
I'm not sure if I broke it though, since I don't understand the function.
Next up! ExtractBedVCF throws this error
Traceback (most recent call last):
File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmprgz_st1m.ExtractBedVCF.py", line 21, in <module>
pos = vcf['variants/POS']
TypeError: 'NoneType' object is not subscriptable
I tried
print(vcf)
pos = vcf['variants/POS']
And it printed a long a log vcf file to the log file, so I know it exists.
The last chromosome it reads is annot.missense.PRELSG_API_v1.vcf.
It only has one SNP according to grep -v '#' annot.missense.PRELSG_API_v1.vcf | wc -l
Here is that vcf file. annot.missense.PRELSG_API_v1.vcf.txt
Here's 2 more unrelated errors
When running PerGeneFstPBS.py
File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmpin7thiea.PerGeneFstPBS.py", line 143, in <module>
ndf.columns = ['GeneID', 'nSNPs']
File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/core/generic.py", line 5475, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/_libs/properties.pyx", line 66, in pandas._libs.properties.AxisProperty.__set__
File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/core/generic.py", line 669, in _set_axis
self._mgr.set_axis(axis, labels)
File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 221, in set_axis
f"Length mismatch: Expected axis has {old_len} elements, new "
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements
and I get this when running StatisticsAndPCA.py
Traceback (most recent call last):
File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmpjvl4kifb.StatisticsAndPCA.py", line 113, in <module>
pca(geno, chrom, ploidy, dataset, populations, metadata, pop_colours, prune=True, scaler=None)
File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 253, in pca
fig_pca(coords1, model1, f"PCA {chrom} {dataset}", f"results/variants/plots/PCA-{chrom}-{dataset}", samples, pop_colours, sample_population=populations)
File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 234, in fig_pca
plot_pca_coords(coords, model, 0, 1, ax, sample_population, samples, pop_colours)
File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 215, in plot_pca_coords
y = coords[:, pc2]
IndexError: index 1 is out of bounds for axis 1 with size 1
those latest errors must be to do with the lack of SNPs on the apicoplast and possibly mitochondrial chromosome - I would remove this from the analysis personally.
Interesting! We're making progress! Removing those chromosomes worked. Perhaps the workflow would be less fragile if a lack of SNPs wouldn't stall it.
Anyway, here is the next error from AlleleTables
:
[mpileup] 1 samples in 1 input files
r/rna-seq-ir/workflow/scripts/mpileup2readcounts/mpileup2readcounts: No such file or directory
The /rna-seq-ir/workflow/scripts/mpileup2readcounts/
is indeed empty.
I don't know why. It should have been cloned with everything else.
I see that it is part of another repo.
I don't know why there is a leading r
in the error though r/rna-seq-ir/workfl...
. Probably just a printing artifact in samtools.
I downloaded mpileup2readcounts.cc directly from the remote repo into the correct directory and compiled it according to the readme. And it works!
I'm still getting the same PerGeneFstPBS error as before. Removing the 2 small chromosomes did not fix this one.
This looks like a comprehensive workflow. I'd like to use it to do DE analysis and variant calling for other species, like non-human malaria. What would I need to do to make that happen?