Adapting workflow to other species

vmkalbskopf commented 3 years ago

This looks like a comprehensive workflow. I'd like to use it to do DE analysis and variant calling for other species, like non-human malaria. What would I need to do to make that happen?

sanjaynagi commented 3 years ago

The transcriptIDs should match with that found in abundance.tsv - Kallisto should be mapping to the reference transcriptome fasta file, and so the 'target_id' from Kallisto should match with the fasta headers.

vmkalbskopf commented 3 years ago

It was reporting the exon IDs before, but now it's reporting the identical ID in the transcript file. But I am getting the same error as before.

Gene diff expr.

 ------------- Kallisto - DESeq2 - RNASeq Differential expression ---------
Joining, by = "TranscriptID"
Error in `.rowNamesDF<-`(x, value = value) :
  missing values in 'row.names' are not allowed
Calls: %>% ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
Execution halted

Isoform DE:

Joining, by = "TranscriptID"
Error: Problem with `mutate()` column `Gene_name`.
ℹ `Gene_name = case_when(...)`.
✖ must be a character vector, not a logical vector.

sanjaynagi commented 3 years ago

difficult to figure out whats going on, without being able to run the script. Do you have any missing rows or duplicates in the Transcript ID column? I would try and run the script in rstudio and see where it falters, just edit the snakemake@input bits to be the paths to your files

vmkalbskopf commented 3 years ago

How do I edit the snakemake@input bits' to be paths to my files? I'm quite lost.

I see the lines

metadata = fread(snakemake@input[['metadata']], sep="\t") %>% as.data.frame()
gene_names = fread(snakemake@input[['genes2transcripts']], sep="\t") %>% distinct()

contrastsdf = data.frame("contrast" = snakemake@params[['DEcontrasts']])

but I don't know how to change them to refer to the file paths instead. metadata and genes2transcripts refers to inputs in the rules.

I have a list of input files generated when the rule fails:

input: config/samples.tsv, resources/Gene2TranscriptMap.tsv, results/quant/R47, results/quant/R48, results/quant/R49, results/quant/R50, results/quant/R51, results/quant/R52, results/quant/B7, results/quant/B8, results/quant/B9

EDIT!

I'm figuring it out...

sanjaynagi commented 3 years ago

Hey Victor. What i mean, is to open the Rscript outside of the snakemake pipeline in Rstudio. You may want to make a copy of the script of or something.

then change

metadata = fread(snakemake@input[['metadata']], sep="\t") %>% as.data.frame()

to

metadata = fread("config/samples.tsv", sep="\t") %>% as.data.frame() contrastsdf = data.frame("contrast" = list("cond1_cond2"))

and do similar for gene_names (genes2transcript file). Then you should be able to run the Rscript step by step (you may need to install the packages), and it will be clearer where the script is failing.

vmkalbskopf commented 3 years ago

Yup! You're too nice Sanjay, and I'm too impatient. I struggled a bit after asking for help, and figured it.

I've gotten to the same point in the script where it's failing when run with sn.

I'll now try figure out why it's saying there are missing row names.

vmkalbskopf commented 3 years ago

>sum(is.na(counts$GeneID))
[1] 1

> sum(is.na(gene_names$GeneID))
[1] 0

The very last row of counts has a missing GeneID.


5177 | PRELSG_MIT00300 | 2108.0000 | 1961 | 1186.00000 | 3249.000 | 1.24500e+03 | 1341.00000

5178 | NA | 53001.0000 | 77939 | 32034.00000 | 46699.000 | 1.09758e+05 | 38210.00000

I hate off-by-one errors. I'm trying to figure out if there there's a line in my Gene2TranscriptMap.tsv that's missing a GeneID.

EDIT I can't find the missing gene ID. I visually inspected the whole file twice.

vmkalbskopf commented 3 years ago

I have 5312 entries in the abundance.tsv file. None of the transcript ID's repeat, but I have 5178 entries in the Gene2Transcript file. All of the genome resources are the same version. I'm lost now.

sanjaynagi commented 3 years ago

I've just stripped them from the PrelictumSG1-like transcriptome.fasta. it has 5312 entries now, but you'll have to do some sort of join if you want gene names and descriptions.

i did the following....

grep -e ">" PlasmoDB-52_PrelictumSGS1-like_AnnotatedTranscripts.fasta | cut -c 2-50 > prelictum_headers.fa then in r...

library(data.table)
library(tidyverse)
names = fread("~/prelictum_headers.fa", header = FALSE)[,1:2]
colnames(names) = c("TranscriptID", "GeneID")
names$GeneID = str_remove(names$GeneID, "gene=")
names$GeneName = ""
fwrite(names, "~/Prelictum_Genes2TranscriptMap.tsv", sep="\t", col.names = TRUE)

Prelictum_Genes2TranscriptMap.tsv.txt

vmkalbskopf commented 3 years ago

Huh.. It works now. Thanks.

I built my Genes2Transcripts file from the gff file. And that has 5178 mRNA entries. This is not relevant to your project, but I wonder why the gff file has fewer transcripts than the Annotated Transcripts fasta file.

vmkalbskopf commented 3 years ago

I've had a frustrating few days trying to figure out why the code running on the local R session worked, but not when run part of the sm workflow. Well, turns out it actually wasn't running all the way through locally. Here is the error while running the DESeq2 function:

 --- Running DESeq2 differential expression analysis on cond1_cond2 --- 
Joining, by = "GeneID"
Joining, by = "GeneID"
Error: Problem with `mutate()` input `Gene_name`.
x must be a character vector, not a logical vector.
ℹ Input `Gene_name` is `case_when(...)`.

I get this identical error locally and in the SM workflow. I don't have canonical gene names. but not sure if that is what the error is referring too. It's just the GeneID, right?

vmkalbskopf commented 3 years ago

I think I found the problem.

 labels = results %>% dplyr::mutate("Gene_name" = case_when(GeneName == "" ~ GeneID,
                                     is.na(GeneName) ~ GeneID,
                                     TRUE ~ GeneID)) %>% select(Gene_name) %>% deframe()

I changed TRUE ~ Gene_name to TRUE ~ GeneID and it seems to work now.

I was getting nearly the identical errors for the Isoform difexpr rule. So I made the same change, swapping out GeneID for TranscriptID in the same function.

I'm not sure if I broke it though, since I don't understand the function.

vmkalbskopf commented 3 years ago

Next up! ExtractBedVCF throws this error

Traceback (most recent call last):
  File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmprgz_st1m.ExtractBedVCF.py", line 21, in <module>
    pos = vcf['variants/POS']
TypeError: 'NoneType' object is not subscriptable

I tried

print(vcf)
pos = vcf['variants/POS']

And it printed a long a log vcf file to the log file, so I know it exists.

The last chromosome it reads is annot.missense.PRELSG_API_v1.vcf. It only has one SNP according to grep -v '#' annot.missense.PRELSG_API_v1.vcf | wc -l

Here is that vcf file. annot.missense.PRELSG_API_v1.vcf.txt

vmkalbskopf commented 3 years ago

Here's 2 more unrelated errors

When running PerGeneFstPBS.py

  File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmpin7thiea.PerGeneFstPBS.py", line 143, in <module>
    ndf.columns = ['GeneID', 'nSNPs']
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/core/generic.py", line 5475, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 66, in pandas._libs.properties.AxisProperty.__set__
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/core/generic.py", line 669, in _set_axis
    self._mgr.set_axis(axis, labels)
  File "/scratch/victor/rna-seq-ir/.snakemake/conda/bde04192122580786f8a35cfcbfe3930/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 221, in set_axis
    f"Length mismatch: Expected axis has {old_len} elements, new "
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements

and I get this when running StatisticsAndPCA.py

Traceback (most recent call last):
  File "/scratch/victor/rna-seq-ir/.snakemake/scripts/tmpjvl4kifb.StatisticsAndPCA.py", line 113, in <module>
    pca(geno, chrom, ploidy, dataset, populations, metadata, pop_colours, prune=True, scaler=None)
  File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 253, in pca
    fig_pca(coords1, model1, f"PCA {chrom} {dataset}", f"results/variants/plots/PCA-{chrom}-{dataset}", samples, pop_colours, sample_population=populations)
  File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 234, in fig_pca
    plot_pca_coords(coords, model, 0, 1, ax, sample_population, samples, pop_colours)
  File "/scratch/victor/rna-seq-ir/workflow/scripts/tools.py", line 215, in plot_pca_coords
    y = coords[:, pc2]
IndexError: index 1 is out of bounds for axis 1 with size 1

sanjaynagi commented 3 years ago

those latest errors must be to do with the lack of SNPs on the apicoplast and possibly mitochondrial chromosome - I would remove this from the analysis personally.

vmkalbskopf commented 3 years ago

Interesting! We're making progress! Removing those chromosomes worked. Perhaps the workflow would be less fragile if a lack of SNPs wouldn't stall it.

Anyway, here is the next error from AlleleTables:

[mpileup] 1 samples in 1 input files
r/rna-seq-ir/workflow/scripts/mpileup2readcounts/mpileup2readcounts: No such file or directory

The /rna-seq-ir/workflow/scripts/mpileup2readcounts/ is indeed empty. I don't know why. It should have been cloned with everything else. I see that it is part of another repo.

I don't know why there is a leading r in the error though r/rna-seq-ir/workfl.... Probably just a printing artifact in samtools.

vmkalbskopf commented 3 years ago

I downloaded mpileup2readcounts.cc directly from the remote repo into the correct directory and compiled it according to the readme. And it works!

I'm still getting the same PerGeneFstPBS error as before. Removing the 2 small chromosomes did not fix this one.

sanjaynagi / rna-seq-pop

Adapting workflow to other species #38