maggimars / Tara-Phaeo

0 stars 0 forks source link

Map Tara data to gene/protein clusters #3

Open maggimars opened 3 years ago

maggimars commented 3 years ago
halexand commented 3 years ago

Perhaps starting with the mapping within the North Atlantic dataset? It is a smaller one of the available larger ocean regions and has some good high-ish latitude / coastal coverage.

I think I am in favor of creating an index of all the concatenated transcripts. We can then tease apart the mappings based on clustering results.

maggimars commented 3 years ago

just dbl checking - for this part I should only use the transcriptomes, right? And, leave out the two genome nucleotide files?

halexand commented 3 years ago

You can include the predicted coding regions from the genomes but you wouldn't want to include the full genome sequence. So filtered transcripts from the genomes, does that make sense?

maggimars commented 3 years ago

Just a note, in order to get an environment with snakemake on poseidon, I had to do the following:

  1. make a new environment: conda create -c conda-forge -c bioconda -n snakemake
  2. activate that environment: conda activate snakemake
  3. install mamba: conda install -c conda-forge mamba
  4. use mamba to install snakemake: mamba install snakemake -c conda-forge -c bioconda

Trying to directly install snakemake into a new environment with conda create -n snakemake -c bioconda -c conda-forge snakemake left the environment trying to solve ... forever. I also was not allowed to install mamba outside an environment as is suggested on the snakemake readthedocs page. Thought this might be useful to someone else in the group at some point.

maggimars commented 3 years ago

I concatenated all the references together but did not try dereplicating yet -- partially because I haven't thought of the best way to do it yet and partly because I wanted to see how many of the Tara sequences mapped to the Phaeo references before I started removing sequences.

Regarding dereplicating:

Exploration of initial mapping results: I mapped all the trimmed metaG and metaT sequences from North Atlantic surface water samples (5µm-20µm size-fraction) to the concatenated reference. I then took a quick look at how many reads from each sample were mapping to the individual references in relation to the total number of quality-filtered trimmed reads.

The metaG mapping results were a little funky. A pretty low proportion of reads from each sample mapped to the reference (less than 1% for most samples) and the majority of mapped reads were to the jgi P. antarctica genome. The jgi P. globosa genome had the second largest proportion of reads. This made me think that there might be a problem with mapping the metaG reads to these transcriptomes ... ?

The metaT mapping results were more as I had expected them to be. About 5-10% of reads in most samples mapped to the reference and the vast majority of mapped reads were to the P. globosa ccmp1528 transcriptome.

maps

halexand commented 3 years ago

Super exciting to see some initial data from the mapping.

Re: de-replicating:

  1. I agree with your first point-- I don't think that de-rep prior to mapping is necessary. I think parsing it on the back end makes sense and shouldn't impact the results significantly. I think in the long run it might make sense to see how things look if you try to parse the results based on your figure in #2 -- gene clustering.
  2. For single-copy orthologs are you thinking of using BUSCO? Yeah, BUSCO recovery in transcriptomes seems a bit extra spotty... I think you are correct that they aren't as well expressed. Or rather that they aren't uniformly expressed.

Mapping results:

Super interesting to see the side-by-side comparison of metaT/metaG.

I think that the magnitude of mapped reads in both the metaT and metaG are as I would expect them. I am not surprised by the difference in species abundance between the metaT and metaG. There is quite a bit of bacterial abundance in the metaG as well as a lot of non-coding material that is going to be missed in our mapping (as we only have the predicted coding regions). Thus lower overall mapping rates.

Re: strange abundance species abundance in the metaG. I wonder if we are potentially seeing a signature suggesting that protein coding regions that are "core" to all strains but are absent from the transcriptomes being captured in the genomes. So, if you have a protein that is common (though slightly different) not assembled in the transcriptome but assembled in the genome then you have it mapping artificially to one of the genomes. Thus, P antarctica being artificially high in the metaG.

Perhaps expressed transcripts in the transcriptomes are also likely to be the expressed transcripts in the metaT. Thus, the missing genes in transcriptomes that are driving the metaG abundance of P anatarctica are less important?

maggimars commented 3 years ago

mappingresults_all

maggimars commented 3 years ago

some follow-up on deduplicating: While looking around at results files from Salmon I read the log for creating the salmon index and realized that salmon automatically deduplicates references when creating the index. In this case, it removed 862 sequences that were duplicates. At first, I thought this might be contributing to the strange Antarctica patterns. I looked at the record of duplicated sequences that were retained and discarded and the vast majority were in the Antarctica jgi genome, but both the retained and discarded were from the same genome. That is, it doesn't seem like sequences from the other references were discarded en masse because they were duplicates of the Antarctica jgi genome. However, there is the option to have the index retain all duplicates (--keepDuplicates when building the index) and I am thinking about redoing the indexing and mapping including this command...

maggimars commented 3 years ago

annots_percents

maggimars commented 3 years ago

allfracheat

maggimars commented 3 years ago

allsamples_heat

maggimars commented 3 years ago

The new heat map shows aggregated logTPM for GO term for each sample. Samples cluster first by size fraction and then by sample type (T or G). It's definitely too much to take in at once...

maggimars commented 3 years ago

mappingrateVannotation.pdf

The relationship between mapping rate and annotation rate makes some sense for the metaG, but it kind of all over the place for the metaT ...

halexand commented 3 years ago

... PCA time?

maggimars commented 3 years ago

The whole world (except Asia and Australia): (at least the patterns we saw in the N. Atlantic are holding for the whole data set - although still puzzling)

percentstrains_allregions

halexand commented 3 years ago

I spoke with Sonya about the weirdness we see. One other question: what would it look like if we included something that is more distantly related (e.g. Ehux? or Chrysochrimulina? etc.). Do we still see discrepancies?

How does this relate to mapping from the metaT assemblies from Tara?

maggimars commented 3 years ago

Add this to the puzzle:

I ran Orthofinder on all the Phaeo protein sequences (except the super small RCC P. cordata transcriptome). Then extracted the single-copy core gene (SCG) sequences from the nucleotide .fastas and mapped the tara data to just the SCGs. The mapping rates were understandably quite low since it was only to 169 genes. The metaT small size fraction looks the most biologically realistic, but now the P. jahnii is really popping out in strange ways.

SCG_allregions

halexand commented 3 years ago

Interesting indeed. It still looks like the metaT is better capturing the species breakdowns? The metaG jahni abundance is quite striking and is very different from the other plots you have created. I need to think more on this.