urol-e5 / deep-dive

0 stars 0 forks source link

lncRNA BLAST results and venn diagram #54

Closed zbengt closed 2 months ago

zbengt commented 3 months ago
  1. I went back and checked my merge of lncRNA BLAST results between the three species, I used the "combined_blast.tab" (https://github.com/urol-e5/deep-dive/blob/main/DEF-cross-species/output/08-comparative-BLASTs/combined_blast.tab) to retrieve transcript ID names and merged using those IDs as the first column. Reference lines This resulted in the venn diagram located in the E5 descriptive doc.
# Load necessary libraries
library(dplyr)
library(ggvenn)

# Load the comprehensive query file with all transcript names
query <- read.delim("../../DEF-cross-species/output/08-comparative-BLASTs/combined_blast.tab", sep = '\t', header = FALSE, row.names = NULL)

# Load the BLAST result files
apul <- read.delim("../output/08-comparative-BLASTs/Apul.tab", sep = '\t', header = FALSE, row.names = NULL)
peve <- read.delim("../output/08-comparative-BLASTs/Peve.tab", sep = '\t', header = FALSE, row.names = NULL)
pmea <- read.delim("../output/08-comparative-BLASTs/Pmea.tab", sep = '\t', header = FALSE, row.names = NULL)

# Check the first few rows of each file to understand the structure
print("Query Data:")
head(query)
print("Apul Data:")
head(apul)
print("Peve Data:")
head(peve)
print("Pmea Data:")
head(pmea)
# Rename the identifier column for consistency
colnames(query)[1] <- "ID"
colnames(apul)[1] <- "ID"
colnames(peve)[1] <- "ID"
colnames(pmea)[1] <- "ID"

# Check column names after renaming
print("Column names after renaming:")
print(names(query))
print(names(apul))
print(names(peve))
print(names(pmea))
# Merge using 'query' to ensure all transcripts are included
comp <- query %>%
  left_join(apul, by = "ID", suffix = c("", "_apul")) %>%
  left_join(peve, by = "ID", suffix = c("", "_peve")) %>%
  left_join(pmea, by = "ID", suffix = c("", "_pmea"))

# Check the structure of the merged DataFrame
print("Merged DataFrame structure:")
head(comp)
# Select and rename the relevant columns for analysis
comp <- comp %>%
  select(ID, 
         apul_hit = V2_apul, 
         apul_evalue = V11_apul, 
         peve_hit = V2_peve, 
         peve_evalue = V11_peve, 
         pmea_hit = V2_pmea, 
         pmea_evalue = V11_pmea)

# Check the DataFrame after selecting the relevant columns
print("DataFrame after selecting relevant columns:")
head(comp)
# Categorize based on hits and add an intermediate check
comp <- comp %>%
  rowwise() %>%
  mutate(Hits = sum(!is.na(c_across(c(apul_hit, peve_hit, pmea_hit)))))

# Check the DataFrame after adding Hits column
print("DataFrame after adding Hits column:")
head(comp)
# Categorize the hits based on species presence
comp <- comp %>%
  mutate(Category = case_when(
    !is.na(apul_hit) & is.na(peve_hit) & is.na(pmea_hit) ~ "Only Apul",
    is.na(apul_hit) & !is.na(peve_hit) & is.na(pmea_hit) ~ "Only Peve",
    is.na(apul_hit) & is.na(peve_hit) & !is.na(pmea_hit) ~ "Only Pmea",
    !is.na(apul_hit) & !is.na(peve_hit) & is.na(pmea_hit) ~ "Apul & Peve",
    !is.na(apul_hit) & is.na(peve_hit) & !is.na(pmea_hit) ~ "Apul & Pmea",
    is.na(apul_hit) & !is.na(peve_hit) & !is.na(pmea_hit) ~ "Peve & Pmea",
    !is.na(apul_hit) & !is.na(peve_hit) & !is.na(pmea_hit) ~ "Apul & Peve & Pmea",
    TRUE ~ "Unknown"
  ))

# Check the DataFrame after categorizing
print("DataFrame after categorizing:")
head(comp)
# Prepare data for ggvenn
venn_data <- list(
  "Apul" = comp$ID[!is.na(comp$apul_hit)],
  "Peve" = comp$ID[!is.na(comp$peve_hit)],
  "Pmea" = comp$ID[!is.na(comp$pmea_hit)]
)

# Visualize with ggvenn
ggvenn(venn_data, 
       c("Apul", "Peve", "Pmea"),
       fill_color = c("#408EC6", "#1E2761", "#7A2048"),
       stroke_size = 0.5, 
       set_name_size = 4, 
       text_size = 4)
  1. I also re-ran the "09-homology" code (https://github.com/urol-e5/deep-dive/blob/main/DEF-cross-species/code/09-homology.Rmd) which uses the original merged FASTA to retrieve transcript names.

I got two different totals with approaches 1 & 2, both of which conflict the text totals listed in the 09-homology code. Seems like it's probably an issue with the step identifying hits between species in the BLAST tab file and/or the merge. Probably a simple fix, but would be helpful to run through it together.

sr320 commented 3 months ago

see https://github.com/RobertsLab/resources/issues/1953#issuecomment-2316013875 - lets get into notebook post - with clear goals, steps, and visualization... eg (showing chunk output..... ) describing what "combined_blast.tab" is etc.. .

zbengt commented 3 months ago

Hey Steven please see my notebook post here. This shows how I did each step. FASTAs and BLAST tables are available on Raven and through Deep-Dive github. I'm pretty sure this is the right way to go about this. But I'm concerned I'm doing something incorrectly at the "Join" steps.

sr320 commented 3 months ago

ok I will take a look - minor but you are not "BLAST each species database against full merged FASTA" you blasted the merged fasta against each species separately.

zbengt commented 3 months ago

I'll correct the wording

sr320 commented 3 months ago

when you start joining you need to show the output of the join - using head is fine. and there should be some explanation of the purpose of the code. I think this will bring issue to front. Or betterdraw it out. remember your query has sequences from three species. a hit to Apul and Peve could be because Apul query hit both, or Peve query hit both, or Apul query hit both and Peve hit one...etc etc. ?

zbengt commented 3 months ago

The issue is definitely ggvenn when creating the venn diagram. Both merges look good. intial_merge.csv shows the results of merging BLAST tables. full_interactions.csv shows that the addition of a 'Category' column labeling hits between species was successful. The end result is... Only Apul: 14,521 occurrences Only Pmea: 11,000 occurrences Only Peve: 6,745 occurrences Apul & Pmea: 2,246 occurrences Apul & Peve: 2,193 occurrences Apul & Peve & Pmea: 1,967 occurrences Peve & Pmea: 777 occurrences

Should note these are the original numbers we estimated.

Seems as though these numbers are inflated and don't add up to the total number of sequences per species. Could just be that I'm not getting the math of overlapping transcripts. But I am wondering if that has something to do with the BLAST setting. Since that might mean we are getting more BLAST hits than input sequences per species.

sr320 commented 3 months ago

What are numbers if you divide two hits by 2 and three hits by 3?

On Fri, Aug 30, 2024 at 4:39 PM Steven Roberts @.***> wrote:

Sketch it out—-

I believe it all has to do with what your query is - a combination of all 3 lncRNA sets. There is artificial inflation

You will get Apul and Peve twice for a given match because for a given pair, both are used as a query.

On Fri, Aug 30, 2024 at 4:22 PM Zach Bengtsson @.***> wrote:

The issue is definitely ggvenn when creating the venn diagram. Both merges look good. intial_merge.csv https://urldefense.com/v3/__https://github.com/user-attachments/files/16822669/intial_merge.csv__;!!K-Hz7m0Vt54!hVkF8wHFkDujU6VEmuUaKmyrOLlBilyz61iRAiQv4ZrD9ZCEM90ACqwy8yhY1VLDwO4DGYMyuz0lnhlVVgJeiDI$ shows the results of merging BLAST tables. full_interactions.csv https://urldefense.com/v3/__https://github.com/user-attachments/files/16822668/full_interactions.csv__;!!K-Hz7m0Vt54!hVkF8wHFkDujU6VEmuUaKmyrOLlBilyz61iRAiQv4ZrD9ZCEM90ACqwy8yhY1VLDwO4DGYMyuz0lnhlV6Q5HW4g$ shows that the addition of a 'Category' column labeling hits between species was successful. The end result is... Only Apul: 14,521 occurrences Only Pmea: 11,000 occurrences Only Peve: 6,745 occurrences Apul & Pmea: 2,246 occurrences Apul & Peve: 2,193 occurrences Apul & Peve & Pmea: 1,967 occurrences Peve & Pmea: 777 occurrences

Seems as though these numbers are inflated and don't add up to the total number of sequences per species. Could just be that I'm not getting the math of overlapping transcripts. But I am wondering if that has something to do with the BLAST setting. Since that might mean we are getting more BLAST hits than input sequences per species.

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/urol-e5/deep-dive/issues/54*issuecomment-2322593710__;Iw!!K-Hz7m0Vt54!hVkF8wHFkDujU6VEmuUaKmyrOLlBilyz61iRAiQv4ZrD9ZCEM90ACqwy8yhY1VLDwO4DGYMyuz0lnhlVcA3mxBI$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABB4PN6FRE3ROBY22W4AG3TZUD5E3AVCNFSM6AAAAABNJLWUG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSGU4TGNZRGA__;!!K-Hz7m0Vt54!hVkF8wHFkDujU6VEmuUaKmyrOLlBilyz61iRAiQv4ZrD9ZCEM90ACqwy8yhY1VLDwO4DGYMyuz0lnhlVfSXDoZU$ . You are receiving this because you were assigned.Message ID: @.***>

sr320 commented 3 months ago

Sketch it out—-

I believe it all has to do with what your query is - a combination of all 3 lncRNA sets. There is artificial inflation

You will get Apul and Peve twice for a given match because for a given pair, both are used as a query.

On Fri, Aug 30, 2024 at 4:22 PM Zach Bengtsson @.***> wrote:

The issue is definitely ggvenn when creating the venn diagram. Both merges look good. intial_merge.csv https://urldefense.com/v3/__https://github.com/user-attachments/files/16822669/intial_merge.csv__;!!K-Hz7m0Vt54!hVkF8wHFkDujU6VEmuUaKmyrOLlBilyz61iRAiQv4ZrD9ZCEM90ACqwy8yhY1VLDwO4DGYMyuz0lnhlVVgJeiDI$ shows the results of merging BLAST tables. full_interactions.csv https://urldefense.com/v3/__https://github.com/user-attachments/files/16822668/full_interactions.csv__;!!K-Hz7m0Vt54!hVkF8wHFkDujU6VEmuUaKmyrOLlBilyz61iRAiQv4ZrD9ZCEM90ACqwy8yhY1VLDwO4DGYMyuz0lnhlV6Q5HW4g$ shows that the addition of a 'Category' column labeling hits between species was successful. The end result is... Only Apul: 14,521 occurrences Only Pmea: 11,000 occurrences Only Peve: 6,745 occurrences Apul & Pmea: 2,246 occurrences Apul & Peve: 2,193 occurrences Apul & Peve & Pmea: 1,967 occurrences Peve & Pmea: 777 occurrences

Seems as though these numbers are inflated and don't add up to the total number of sequences per species. Could just be that I'm not getting the math of overlapping transcripts. But I am wondering if that has something to do with the BLAST setting. Since that might mean we are getting more BLAST hits than input sequences per species.

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/urol-e5/deep-dive/issues/54*issuecomment-2322593710__;Iw!!K-Hz7m0Vt54!hVkF8wHFkDujU6VEmuUaKmyrOLlBilyz61iRAiQv4ZrD9ZCEM90ACqwy8yhY1VLDwO4DGYMyuz0lnhlVcA3mxBI$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABB4PN6FRE3ROBY22W4AG3TZUD5E3AVCNFSM6AAAAABNJLWUG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSGU4TGNZRGA__;!!K-Hz7m0Vt54!hVkF8wHFkDujU6VEmuUaKmyrOLlBilyz61iRAiQv4ZrD9ZCEM90ACqwy8yhY1VLDwO4DGYMyuz0lnhlVfSXDoZU$ . You are receiving this because you were assigned.Message ID: @.***>

zbengt commented 3 months ago

Seems about right. I divided by 2 for pairwise and by 3 for all three, then added them up with unique totals and got 35,529. Adding up the totals from the FASTAs is 35,979. So still a little off, but definitely much closer. We also had about 16 transcript with 0 hits.

sr320 commented 3 months ago

How can we get no hits if query and database contain same sequence?

On Fri, Aug 30, 2024 at 4:58 PM Zach Bengtsson @.***> wrote:

Seems about right. I divided by 2 for pairwise and by 3 for all three, then added them up with unique totals and got 35,529. Adding up the totals from the FASTAs is 35,979. So still a little off, but definitely much closer. We also had about 16 transcript with 0 hits.

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/urol-e5/deep-dive/issues/54*issuecomment-2322610691__;Iw!!K-Hz7m0Vt54!mIJz8FBDogAHTF8FwSzUBBw0NdIW_8KVmLzobzPPFIzGOX6NMKG2O6SqEK4brg4tbXgftr70gtnBx4r4G8FobZg$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABB4PNZYSA74YSH56VWIEALZUEBL5AVCNFSM6AAAAABNJLWUG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSGYYTANRZGE__;!!K-Hz7m0Vt54!mIJz8FBDogAHTF8FwSzUBBw0NdIW_8KVmLzobzPPFIzGOX6NMKG2O6SqEK4brg4tbXgftr70gtnBx4r4QNFi8M0$ . You are receiving this because you were assigned.Message ID: @.***>

zbengt commented 3 months ago

Excellent question. I think the only reason unknowns are even recorded in this table is because we used the transcript ID names from the merged FASTA of all three for the first column of the joined BLAST table; otherwise, those transcript IDs would have been excluded from BLAST results since there was no hit. I'm going to see if those are transcript ID insertions within the original FASTAs that don't actually have a sequence. I can do that when I'm back at my computer.

JillAshey commented 2 months ago

@zbengt where do the official lncRNA fastas/gffs/bed files live? There were a couple of iterations in the repo but not sure which one was the correct one for each species. Planning on running the bedtools closest analysis

zbengt commented 2 months ago

@JillAshey here are the FASTAs we've been working with. We may not have actually generated updated bed files, unless you found them. I see you've been working on the bedtools closest analysis so let me know if you still need them! Apul: https://raw.githubusercontent.com/urol-e5/deep-dive/main/D-Apul/output/05.33-lncRNA-discovery/Apul_lncRNA.fasta Peve: https://raw.githubusercontent.com/urol-e5/deep-dive/main/E-Peve/output/Peve_lncRNA.fasta Pmea/tuah: https://raw.githubusercontent.com/urol-e5/deep-dive/main/F-Pmea/output/02-lncRNA-discovery/Pmea_lncRNA.fasta

zbengt commented 2 months ago

LncRNA overlap issues fixed by dividing 2 match totals by 2 and 3 match total by 3. Looks like BLAST also filtered out our low complexity sequences which is why we had some sequences that did not have any hits. Updated venn diagram and wording are in E5 descriptive doc.