Closed zbengt closed 2 months ago
see https://github.com/RobertsLab/resources/issues/1953#issuecomment-2316013875 - lets get into notebook post - with clear goals, steps, and visualization... eg (showing chunk output..... ) describing what "combined_blast.tab" is etc.. .
Hey Steven please see my notebook post here. This shows how I did each step. FASTAs and BLAST tables are available on Raven and through Deep-Dive github. I'm pretty sure this is the right way to go about this. But I'm concerned I'm doing something incorrectly at the "Join" steps.
ok I will take a look - minor but you are not "BLAST each species database against full merged FASTA" you blasted the merged fasta against each species separately.
I'll correct the wording
when you start joining you need to show the output of the join - using head is fine. and there should be some explanation of the purpose of the code. I think this will bring issue to front. Or betterdraw it out. remember your query has sequences from three species. a hit to Apul and Peve could be because Apul query hit both, or Peve query hit both, or Apul query hit both and Peve hit one...etc etc. ?
The issue is definitely ggvenn when creating the venn diagram. Both merges look good. intial_merge.csv shows the results of merging BLAST tables. full_interactions.csv shows that the addition of a 'Category' column labeling hits between species was successful. The end result is... Only Apul: 14,521 occurrences Only Pmea: 11,000 occurrences Only Peve: 6,745 occurrences Apul & Pmea: 2,246 occurrences Apul & Peve: 2,193 occurrences Apul & Peve & Pmea: 1,967 occurrences Peve & Pmea: 777 occurrences
Should note these are the original numbers we estimated.
Seems as though these numbers are inflated and don't add up to the total number of sequences per species. Could just be that I'm not getting the math of overlapping transcripts. But I am wondering if that has something to do with the BLAST setting. Since that might mean we are getting more BLAST hits than input sequences per species.
What are numbers if you divide two hits by 2 and three hits by 3?
On Fri, Aug 30, 2024 at 4:39 PM Steven Roberts @.***> wrote:
Sketch it out—-
I believe it all has to do with what your query is - a combination of all 3 lncRNA sets. There is artificial inflation
You will get Apul and Peve twice for a given match because for a given pair, both are used as a query.
On Fri, Aug 30, 2024 at 4:22 PM Zach Bengtsson @.***> wrote:
The issue is definitely ggvenn when creating the venn diagram. Both merges look good. intial_merge.csv https://urldefense.com/v3/__https://github.com/user-attachments/files/16822669/intial_merge.csv__;!!K-Hz7m0Vt54!hVkF8wHFkDujU6VEmuUaKmyrOLlBilyz61iRAiQv4ZrD9ZCEM90ACqwy8yhY1VLDwO4DGYMyuz0lnhlVVgJeiDI$ shows the results of merging BLAST tables. full_interactions.csv https://urldefense.com/v3/__https://github.com/user-attachments/files/16822668/full_interactions.csv__;!!K-Hz7m0Vt54!hVkF8wHFkDujU6VEmuUaKmyrOLlBilyz61iRAiQv4ZrD9ZCEM90ACqwy8yhY1VLDwO4DGYMyuz0lnhlV6Q5HW4g$ shows that the addition of a 'Category' column labeling hits between species was successful. The end result is... Only Apul: 14,521 occurrences Only Pmea: 11,000 occurrences Only Peve: 6,745 occurrences Apul & Pmea: 2,246 occurrences Apul & Peve: 2,193 occurrences Apul & Peve & Pmea: 1,967 occurrences Peve & Pmea: 777 occurrences
Seems as though these numbers are inflated and don't add up to the total number of sequences per species. Could just be that I'm not getting the math of overlapping transcripts. But I am wondering if that has something to do with the BLAST setting. Since that might mean we are getting more BLAST hits than input sequences per species.
— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/urol-e5/deep-dive/issues/54*issuecomment-2322593710__;Iw!!K-Hz7m0Vt54!hVkF8wHFkDujU6VEmuUaKmyrOLlBilyz61iRAiQv4ZrD9ZCEM90ACqwy8yhY1VLDwO4DGYMyuz0lnhlVcA3mxBI$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABB4PN6FRE3ROBY22W4AG3TZUD5E3AVCNFSM6AAAAABNJLWUG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSGU4TGNZRGA__;!!K-Hz7m0Vt54!hVkF8wHFkDujU6VEmuUaKmyrOLlBilyz61iRAiQv4ZrD9ZCEM90ACqwy8yhY1VLDwO4DGYMyuz0lnhlVfSXDoZU$ . You are receiving this because you were assigned.Message ID: @.***>
Sketch it out—-
I believe it all has to do with what your query is - a combination of all 3 lncRNA sets. There is artificial inflation
You will get Apul and Peve twice for a given match because for a given pair, both are used as a query.
On Fri, Aug 30, 2024 at 4:22 PM Zach Bengtsson @.***> wrote:
The issue is definitely ggvenn when creating the venn diagram. Both merges look good. intial_merge.csv https://urldefense.com/v3/__https://github.com/user-attachments/files/16822669/intial_merge.csv__;!!K-Hz7m0Vt54!hVkF8wHFkDujU6VEmuUaKmyrOLlBilyz61iRAiQv4ZrD9ZCEM90ACqwy8yhY1VLDwO4DGYMyuz0lnhlVVgJeiDI$ shows the results of merging BLAST tables. full_interactions.csv https://urldefense.com/v3/__https://github.com/user-attachments/files/16822668/full_interactions.csv__;!!K-Hz7m0Vt54!hVkF8wHFkDujU6VEmuUaKmyrOLlBilyz61iRAiQv4ZrD9ZCEM90ACqwy8yhY1VLDwO4DGYMyuz0lnhlV6Q5HW4g$ shows that the addition of a 'Category' column labeling hits between species was successful. The end result is... Only Apul: 14,521 occurrences Only Pmea: 11,000 occurrences Only Peve: 6,745 occurrences Apul & Pmea: 2,246 occurrences Apul & Peve: 2,193 occurrences Apul & Peve & Pmea: 1,967 occurrences Peve & Pmea: 777 occurrences
Seems as though these numbers are inflated and don't add up to the total number of sequences per species. Could just be that I'm not getting the math of overlapping transcripts. But I am wondering if that has something to do with the BLAST setting. Since that might mean we are getting more BLAST hits than input sequences per species.
— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/urol-e5/deep-dive/issues/54*issuecomment-2322593710__;Iw!!K-Hz7m0Vt54!hVkF8wHFkDujU6VEmuUaKmyrOLlBilyz61iRAiQv4ZrD9ZCEM90ACqwy8yhY1VLDwO4DGYMyuz0lnhlVcA3mxBI$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABB4PN6FRE3ROBY22W4AG3TZUD5E3AVCNFSM6AAAAABNJLWUG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSGU4TGNZRGA__;!!K-Hz7m0Vt54!hVkF8wHFkDujU6VEmuUaKmyrOLlBilyz61iRAiQv4ZrD9ZCEM90ACqwy8yhY1VLDwO4DGYMyuz0lnhlVfSXDoZU$ . You are receiving this because you were assigned.Message ID: @.***>
Seems about right. I divided by 2 for pairwise and by 3 for all three, then added them up with unique totals and got 35,529. Adding up the totals from the FASTAs is 35,979. So still a little off, but definitely much closer. We also had about 16 transcript with 0 hits.
How can we get no hits if query and database contain same sequence?
On Fri, Aug 30, 2024 at 4:58 PM Zach Bengtsson @.***> wrote:
Seems about right. I divided by 2 for pairwise and by 3 for all three, then added them up with unique totals and got 35,529. Adding up the totals from the FASTAs is 35,979. So still a little off, but definitely much closer. We also had about 16 transcript with 0 hits.
— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/urol-e5/deep-dive/issues/54*issuecomment-2322610691__;Iw!!K-Hz7m0Vt54!mIJz8FBDogAHTF8FwSzUBBw0NdIW_8KVmLzobzPPFIzGOX6NMKG2O6SqEK4brg4tbXgftr70gtnBx4r4G8FobZg$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABB4PNZYSA74YSH56VWIEALZUEBL5AVCNFSM6AAAAABNJLWUG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSGYYTANRZGE__;!!K-Hz7m0Vt54!mIJz8FBDogAHTF8FwSzUBBw0NdIW_8KVmLzobzPPFIzGOX6NMKG2O6SqEK4brg4tbXgftr70gtnBx4r4QNFi8M0$ . You are receiving this because you were assigned.Message ID: @.***>
Excellent question. I think the only reason unknowns are even recorded in this table is because we used the transcript ID names from the merged FASTA of all three for the first column of the joined BLAST table; otherwise, those transcript IDs would have been excluded from BLAST results since there was no hit. I'm going to see if those are transcript ID insertions within the original FASTAs that don't actually have a sequence. I can do that when I'm back at my computer.
@zbengt where do the official lncRNA fastas/gffs/bed files live? There were a couple of iterations in the repo but not sure which one was the correct one for each species. Planning on running the bedtools closest analysis
@JillAshey here are the FASTAs we've been working with. We may not have actually generated updated bed files, unless you found them. I see you've been working on the bedtools closest analysis so let me know if you still need them! Apul: https://raw.githubusercontent.com/urol-e5/deep-dive/main/D-Apul/output/05.33-lncRNA-discovery/Apul_lncRNA.fasta Peve: https://raw.githubusercontent.com/urol-e5/deep-dive/main/E-Peve/output/Peve_lncRNA.fasta Pmea/tuah: https://raw.githubusercontent.com/urol-e5/deep-dive/main/F-Pmea/output/02-lncRNA-discovery/Pmea_lncRNA.fasta
LncRNA overlap issues fixed by dividing 2 match totals by 2 and 3 match total by 3. Looks like BLAST also filtered out our low complexity sequences which is why we had some sequences that did not have any hits. Updated venn diagram and wording are in E5 descriptive doc.
I got two different totals with approaches 1 & 2, both of which conflict the text totals listed in the 09-homology code. Seems like it's probably an issue with the step identifying hits between species in the BLAST tab file and/or the merge. Probably a simple fix, but would be helpful to run through it together.