smith-chem-wisc / Spritz

Software for RNA-Seq analysis to create sample-specific proteoform databases from RNA-Seq data
https://smith-chem-wisc.github.io/Spritz/
MIT License
7 stars 11 forks source link

Generating proteogenomic database for Pseudomonas with VCF called from WGS (or exome seq) data #185

Open animesh opened 4 years ago

animesh commented 4 years ago

I am wondering how can i add something like Pseudomonas aeruginosa ?

The fasta file for the reference proteome is available at https://www.uniprot.org/proteomes/UP000002438 , any ideas on how to proceed will be appreciated :)

trishorts commented 4 years ago

You want to do Spritz for Pseudomonas?

acesnik commented 4 years ago

Spritz is currently built to call variants from eukaryotes with RNA-Seq data, so this would take a new workflow.

What type of sequencing data do you have for the sample (e.g. exome, genome)?

Here's the ensembl genome for Pseudomonas: http://bacteria.ensembl.org/Pseudomonas_aeruginosa_pao1/Info/Index. There's no reference VCF like we're using for human in GATK.

acesnik commented 4 years ago

We would also need to implement using other codon tables for this feature https://github.com/smith-chem-wisc/Spritz/issues/164

animesh commented 4 years ago

I have WGS data for this bacteria which seems to have diverged from main based on assembly so using canonical proteome is clearly suboptimal. I see that GFF is available at ftp://ftp.ensemblgenomes.org/pub/bacteria/current/gff3/bacteria_67_collection/pseudomonas_aeruginosa/ , probably one can use it to call the variants and create a strain-specific VCF ?

acesnik commented 4 years ago

This is definitely a good direction to take Spritz. It's also good that the GFF file is available. I know @rmmiller22 was working on vervet monkey samples, which had that situation, i.e. no reference VCF available.

I unfortunately don't have the bandwidth to add this feature to Spritz right now, but we'll keep you posted as we work towards this goal.

By the way, what tool do you typically use to align WGS reads to bacterial genomes? Bowtie/BWA?

acesnik commented 4 years ago

Oh, an option in the meantime is that you could generate a VCF file for your sample using other means and run it through the custom SnpEff fork that is part of Spritz with the options -protFasta {file} and -protXml {file} specified. This should generate FASTA and XML files that could be used in MetaMorpheus or other search software. SnpEff has ~270 different Pseudomonas references, which is a lot. For example, one of them is Pseudomonas_aeruginosa, which you could use for this analysis with java -Xmx16M -jar snpEff.jar -v -stats {output.html} -fastaProt {output.protfa} -xmlProt {output.protxml} Pseudomonas_aeruginosa {input.vcf} > {output.vcf}, where the bracketed bits are replaced with your desired input/output files.