laurahspencer / DuMOAR

0 stars 0 forks source link

Decide on version of genome to use #23

Closed sr320 closed 1 year ago

sr320 commented 1 year ago

Seems like a few options for genome.

Full (scaffolds and contigs) - filename: PGA_assembly.fasta

Full - filtered (scaffolds and contigs): PGA_assembly.fasta.filtered.renamed.fa?

Scaffolds

Scaffolds filtered


What does filtered mean?


My vote is for scaffolds only

kristamnichols commented 1 year ago

Filtered == removed contigs that were flagged by Phase as being questionable, but the filtered file DOES NOT include only the 49. If you look at the report.txt in the GDrive you can see the stats for the 49 scaffolds created by HiC. Should be straightforward to subset the genome to only the _scaffold\d+ sequences in the file. I'm stacked with meetings, but maybe @ggoetznoaa can do that if you're not already on it, @sr320?

kristamnichols commented 1 year ago

My prior foray into identification of the mtDNA in the 49 indicated that the two smallest scaffolds could be part of the mtDNA, but I wanted to more fully validate that, which I hadn't gotten to yet.

ggoetznoaa commented 1 year ago

PGA_assembly.fasta.filtered.renamed.fa was created so we could upload the file to NCBI. I had to rename the sequences in the fasta file because their names were too long and there were 17 sequences that were below NCBI's length threshold of 200 bp that had to be filtered out. The PGA_assembly.fasta file is the original assembly file and the PGA_assembly.fasta.filtered.fa is the fasta file after I removed the short contigs and the filtered.renamed.fa is me changing the names of the sequences from the filtered file.

I tried to find an email with the back and forth between Krista and I about this but I think we did it over Google Chat.

kristamnichols commented 1 year ago

Thanks for documenting this here @ggoetznoaa

sr320 commented 1 year ago

My suggestion - interested in feedback - is we use the version of genome that only includes scaffolds.

kristamnichols commented 1 year ago

I agree that we should just use the scaffolds for alignments, annotation, etc. Does not having the mtDNA therein present issues for evaluating MBD-seq quality?

kristamnichols commented 1 year ago

@ggoetznoaa please run busco on the scaffolds only and report those stats here? The BUSCO stats for the whole genome + contigs, using the arthropoda_odb10 BUSCO database, showed 96.1% completeness when you ran it some time ago.

ggoetznoaa commented 1 year ago

@kristamnichols just so its clear, the scaffolds are the sequences (49 of them) with the name scaffold in them (0 thru 48)?

kristamnichols commented 1 year ago

Yes -- 49 scaffolds numbered 0 to 48 -- thanks! **CORRECTED type

ggoetznoaa commented 1 year ago

40? I'm counting 49.

ggoetznoaa commented 1 year ago

BUSCO Finished, here are the results

C:88.2%[S:86.9%,D:1.3%],F:3.4%,M:8.4%,n:1013       
893 Complete BUSCOs (C)            
880 Complete and single-copy BUSCOs (S)    
13  Complete and duplicated BUSCOs (D)     
34  Fragmented BUSCOs (F)              
86  Missing BUSCOs (M)             
1013    Total BUSCO groups searched

And this was with the arthropoda_odb10 database.

kristamnichols commented 1 year ago

What does the group think? Limit analysis to the 49 scaffolds, use ALL the data, or use scaffolds / contigs over a certain size? If the latter, what size would be logical, and we can run BUSCO on that as well.

ggoetznoaa commented 1 year ago

To give you an idea of the breakdown of sequence lengths, here are the summary stats of the lengths of ALL of the sequences from the original fasta file.

Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  22     1171     4692   263709    30859 42616187 
sr320 commented 1 year ago

What does the group think? Limit analysis to the 49 scaffolds, use ALL the data, or use scaffolds / contigs over a certain size? If the latter, what size would be logical, and we can run BUSCO on that as well.

I think it would be worth running BUSCO on what was submitted to NCBI.... which I assume is scaffolds / contigs over a certain size?

kristamnichols commented 1 year ago

@ggoetznoaa can you please run BUSCO on the 'filtered' set submitted to NCBI? I don't expect it to be different from the original BUSCO run on the full assembly (inclusive of scaffolds + contigs). Those BUSCO stats were:

BUSCO database used | arthropoda_odb10
BUSCO complete BUSCOs | 973
% BUSCO complete | 96.1
BUSCO complete and single copy | 959
% BUSCO complete and single copy | 94.7
BUSCO complete and duplicated | 14
% BUSCO complete and duplicated | 1.4
fragmented BUSCO | 14
% fragmented BUSCO | 1.4
missing BUSCOs | 26
% missing BUSCO | 2.5
Total BUSCOs searched | 1013
ggoetznoaa commented 1 year ago

Here are the results for BUSCO using the file that was submitted to NCBI

C:94.4%[S:92.9%,D:1.5%],F:3.4%,M:2.2%,n:1013       
956 Complete BUSCOs (C)            
941 Complete and single-copy BUSCOs (S)    
15  Complete and duplicated BUSCOs (D)     
34  Fragmented BUSCOs (F)              
23  Missing BUSCOs (M)             
1013    Total BUSCO groups searched        
sr320 commented 1 year ago

It would also be great to see mapping rate differences - I can do this if @laurahspencer can get me the MBD data..

laurahspencer commented 1 year ago

@sr320 the trimmed MBDBS data is on Hyak here - /gscratch/srlab/lhs3/data/DuMOAR/mbdbs-trimmed/

The 4416/ and 4417/ contain the 100bp and 150bp reads, respectively.

laurahspencer commented 1 year ago

Do folks have final thoughts on which version of the genome to use? @sr320 did you end up looking at mapping rates using the MBD data?

sr320 commented 1 year ago

My thought is scaffold only. no did not do a full comparison..

kristamnichols commented 1 year ago

I think this is a good plan. How to proceed? Will you rerun analyses Laura or should we see if Giles can help?On Apr 10, 2023, at 2:50 PM, Steven Roberts @.***> wrote: My thought is scaffold only. no did not do a full comparison..

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

laurahspencer commented 1 year ago

Copy, so I will use the scaffold-only genome for aligning, i.e. only scaffolds #0-48. @ggoetznoaa do you already have that version of the genome somewhere on Sedna?

I will re-run my existing pipeline, which gets me to the identification of differentially methylated loci. I do have a preliminary gene track, so can take a stab at functional analysis, time depending.

kristamnichols commented 1 year ago

@ggoetznoaa is there a scaffold only version of the Dungness genome on Sedna, and if not, can you please create and share with @laurahspencer? Thanks! The plan you outline sounds good to me. Let us know if you want help with anything Laura.

ggoetznoaa commented 1 year ago

There is one, it can be found on Sedna in

/share/nwfsc/ggoetz/202301-dungeness_crab-transcriptome/ref/fasta

The file is called

PGA_assembly.scaffolds_only.fasta

@kristamnichols @laurahspencer