Using fasta (DNA sequences) and gff file from busco genes

KBeine commented 7 months ago

Hi all, is it possible to create a riparian plot from DNA sequences? I want to create a synteny plot for genomes from worm species. However, the worm species dont have their own gff files, so I put the genomes through busco and have ~255 genes for the 4 worm species with their fasta files and gff files output from busco. I concatenated all the files for each species, so only left with 2 files (fasta and gff) for each of them. But the data is DNA and not proteins. Can I still use genespace? I got genespace to work with the example data and when I try to run my data it tells me Error: subscript contains invalid names when trying the ParsedPaths command, my code for this is: ParsedPaths <- parse_annotation( rawGenomeRepo = genomeRepo, genomeDirs = genomes2run, genomeIDs = genomes2run, gffString = "gff", faString = "fasta", presets = 'ncbi' genespaceWd = wd)

If its relevant my data looks like this:

$ head Alitta_busco.fasta ">1001705at2759_29159_0:0046be|OW028576.1:45821503-45827982 caggATTTTGATGGAGCCCTGGACTGCGTGAGCATTGCAGTAACCTGTGCATTCAACCGA CATGGCACACTTCTCGCAGTTGGGTGTAATGATGGCCGAATTGTCGTTTGGGATTTTCTC ACTAGGGGAATAGCAAAAGTCATCAGTGCCCATGTTCATCCTGTGTGTAGCATCAGCTGG AGTCGCAATGGTAAAAAACTGCTGAGTGCTGCGACGGACAACACTGTCTCTATTTGGGAC ATCATGACGGGAGAATGTGACAAAACGTTCCGATTCCCTTCTCCCGTTCTGAAGGTCCAG TTCAATCCTCGAGACAGCAAAATGTTCATCATTTGTCCGATGAAACATGCGGCGGTCGTG ATGAGCTTAGACGGATCATATACGACTATACCTCTGGATGATGATTCTGACCTGAATGTT GTGGCGGCGTTCGACCGACAAGGAGACCACATATACACAGGCAATGCTAAGGGCAAGGTT TGTGCTTatAAAACGAAGAGCCTAGATGTTGTAGCGTCTTTCAAAGTCACAACTGGAACA"

head -5 Alitta_busco.gff "OW028573.1 MetaEuk gene 46740364 46740573 44 - . Target_ID=1001705at2759_1155016_0:002f22;TCS_ID=1001705at2759_1155016_0:002f22|OW028573.1|-|46740363 OW028573.1 MetaEuk mRNA 46740364 46740573 44 - . Target_ID=1001705at2759_1155016_0:002f22;TCS_ID=1001705at2759_1155016_0:002f22|OW028573.1|-|46740363_mRNA;Parent=1001705at2759_1155016_0:002f22|OW028573.1|-|46740363 OW028573.1 MetaEuk exon 46740364 46740573 44 - . Target_ID=1001705at2759_1155016_0:002f22;TCS_ID=1001705at2759_1155016_0:002f22|OW028573.1|-|46740363_exon_0;Parent=1001705at2759_1155016_0:002f22|OW028573.1|-|46740363_mRNA OW028573.1 MetaEuk CDS 46740364 46740573 44 - . Target_ID=1001705at2759_1155016_0:002f22;TCS_ID=1001705at2759_1155016_0:002f22|OW028573.1|-|46740363_CDS_0;Parent=1001705at2759_1155016_0:002f22|OW028573.1|-|46740363_exon_0 OW028576.1 MetaEuk gene 45821504 45827983 692 - . Target_ID=1001705at2759_29159_0:0046be;TCS_ID=1001705at2759_29159_0:0046be|OW028576.1|-|45821503"

jtlovell commented 7 months ago

yes - it is possible to make a riparian for WGAs, just not with GENESPACE. We are currently trying to determine the best way to release this functionality. It may be added to GENESPACE, or may be a separate package. No matter the release method, it won't be public until fall 2024 at the earliest. If this is a functionality that your research absolutely needs sooner than this timeline, email me and we can figure out a path forward.

KBeine commented 7 months ago

I have a separate issue now, but I used fa (peptide) files instead of fasta (DNA) files, the parse function didn't work with my gff and fa files to create the cleaned peptide files and the bed files, so I created them myself at the suggestion of another issue. I had to do a bit of matching between the bed and peptide files to get the gpar <- init_genespace(wd = wd, path2mcscanx = path2mcscanx) function to work, I am now stuck on the out <- run_genespace(gpar, overwrite = T) function:

gpar <- init_genespace(

wd = wd,

path2mcscanx = path2mcscanx) Checking Working Directory ... PASS: /home/user/Documents/UBUNTU/Alignment Checking user-defined parameters ... Genome IDs & ploidy ... alitta : 1 Hdiv : 1 platynereis: 1 Outgroup ... NONE n. parallel processes ... 8 collinear block size ... 5 collinear block search radius ... 25 n gaps in collinear block ... 5 synteny buffer size... 100 only orthogroups hits as anchors ... TRUE n secondary hits ... 0 Checking annotation files (.bed and peptide .fa): alitta : 246 / 246 geneIDs exactly match (PASS) Hdiv : 221 / 221 geneIDs exactly match (PASS) platynereis: 239 / 239 geneIDs exactly match (PASS) Checking dependencies ... Found valid path to OrthoFinder v2.55: orthofinder Found valid path to DIAMOND2 v2.19: diamond Found valid MCScanX_h executable: /home/user/Apps/MCScanX/MCScanX_h
GENESPACE prints a ton of output. We don’t provide a mechanism to suppress this because it is crucial to make sure things are looking ok. There are some descriptions about the text printed to the console (see below output …)

out <- run_genespace(gpar, overwrite = T)

############################

Running orthofinder (or parsing existing results) Checking for existing orthofinder results ... [1] TRUE Copying files over to the temporary directory: /home/user/Documents/UBUNTU/Alignment/tmp Running the following command in the shell: orthofinder -f /home/user/Documents/UBUNTU/Alignment/tmp -t 8 -a 1 -X -o /home/user/Documents/UBUNTU/Alignment/orthofinder.This can take a while. To check the progress, look in the WorkingDirectory in the output (-o) directory

OrthoFinder version 2.5.5 Copyright (C) 2014 David Emms

2024-04-25 18:50:02 : Starting OrthoFinder 2.5.5 8 thread(s) for highly parallel tasks (BLAST searches etc.) 1 thread(s) for OrthoFinder algorithm

Checking required programs are installed

Test can run "mcl -h" - ok Test can run "fastme -i /home/user/Documents/UBUNTU/Alignment/orthofinder/Results_Apr25/WorkingDirectory/dependencies/SimpleTest.phy -o /home/user/Documents/UBUNTU/Alignment/orthofinder/Results_Apr25/WorkingDirectory/dependencies/SimpleTest.tre" - ok

Dividing up work for BLAST for parallel processing

2024-04-25 18:50:02 : Creating diamond database 1 of 3 2024-04-25 18:50:02 : Creating diamond database 2 of 3 2024-04-25 18:50:02 : Creating diamond database 3 of 3

Running diamond all-versus-all

Using 8 thread(s) 2024-04-25 18:50:02 : This may take some time.... 2024-04-25 18:50:02 : Done 0 of 9 2024-04-25 18:50:06 : Done all-versus-all sequence search

Running OrthoFinder algorithm

2024-04-25 18:50:06 : Initial processing of each species 2024-04-25 18:50:06 : Initial processing of species 0 complete 2024-04-25 18:50:06 : Initial processing of species 1 complete 2024-04-25 18:50:06 : Initial processing of species 2 complete 2024-04-25 18:50:08 : Connected putative homologues 2024-04-25 18:50:08 : Written final scores for species 0 to graph file 2024-04-25 18:50:08 : Written final scores for species 1 to graph file 2024-04-25 18:50:08 : Written final scores for species 2 to graph file 2024-04-25 18:50:09 : Ran MCL

Writing orthogroups to file

OrthoFinder assigned 697 genes (98.7% of total) to 224 orthogroups. Fifty percent of all genes were in orthogroups with 3 or more genes (G50 was 3) and were contained in the largest 99 orthogroups (O50 was 99). There were 192 orthogroups with all species present and 184 of these consisted entirely of single-copy genes.

2024-04-25 18:50:09 : Done orthogroups

Analysing Orthogroups

Calculating gene distances

2024-04-25 18:50:11 : Done 2024-04-25 18:50:12 : Done 0 of 9

Inferring gene and species trees

Best outgroup(s) for species tree

2024-04-25 18:50:13 : Starting STRIDE 2024-04-25 18:50:13 : Done STRIDE Observed 0 well-supported, non-terminal duplications. 0 support the best roots and 0 contradict them. Best outgroups for species tree: Hdiv alitta platynereis

WARNING: Multiple potential species tree roots were identified, only one will be analyed.

Reconciling gene trees and species tree

Outgroup: Hdiv 2024-04-25 18:50:13 : Starting Recon and orthologues 2024-04-25 18:50:13 : Starting OF Orthologues 2024-04-25 18:50:13 : Done 0 of 9 2024-04-25 18:50:13 : Done OF Orthologues

Writing results files

2024-04-25 18:50:13 : Done orthologues

Results: /home/user/Documents/UBUNTU/Alignment/orthofinder/Results_Apr25/

CITATION: When publishing work that uses OrthoFinder please cite: Emms D.M. & Kelly S. (2019), Genome Biology 20:238

If you use the species tree in your work then please also cite: Emms D.M. & Kelly S. (2017), MBE 34(12): 3267-3278 Emms D.M. & Kelly S. (2018), bioRxiv https://doi.org/10.1101/267914 re-ordering genomeIDs by the species tree: alitta, platynereis, Hdiv

############################

Combining and annotating bed files w/ OGs and tandem array info ... ############## Flagging chrs. w/ < 10 unique orthogroups ...Hdiv : 38 genes on 6 small chrs. ...alitta : 27 genes on 4 small chrs. ...platynereis: 113 genes on 42 small chrs. NOTE! Genomes flagged have > 5% of genes on small chrs. These are likely not great assemblies and should be examined carefully ############## Flagging over-dispered OGs ...Hdiv : 0 genes in 0 OGs hit > 8 unique places ...alitta : 0 genes in 0 OGs hit > 8 unique places ...platynereis: 0 genes in 0 OGs hit > 8 unique places ############## Annotation summaries (after exclusions): ...Hdiv : 183 genes in 181 OGs || 0 genes in 0 arrays ...alitta : 219 genes in 217 OGs || 0 genes in 0 arrays ...platynereis: 126 genes in 126 OGs || 0 genes in 0 arrays

############################

Combining and annotating the blast files with orthogroup info ...
Chunk 1 / 1 (06:50:14 PM) ...

...platynereis v. alitta: total hits = 310, same og = 237 ...Hdiv v. platynereis: total hits = 283, same og = 213 ...alitta v. Hdiv: total hits = 284, same og = 217 ...platynereis v. platynereis: total hits = 319, same og = 243 ...alitta v. alitta: total hits = 321, same og = 248 ...Hdiv v. Hdiv: total hits = 275, same og = 223 ############## Generating dotplots for all hits ... Done!

############################

Flagging synteny for each pair of genomes ...
Chunk 1 / 1 (06:50:16 PM) ...

...platynereis v. alitta: 123 hits (60 anchors) in 5 blocks (0 SVs, 5 regions) ...alitta v. Hdiv: 150 hits (75 anchors) in 17 blocks (12 SVs, 5 regions) ...Hdiv v. platynereis: 115 hits (52 anchors) in 6 blocks (1 SVs, 5 regions) ...alitta v. alitta: 255 hits (246 anchors) in 14 blocks (0 SVs, 0 regions) ...platynereis v. platynereis: 240 hits (239 anchors) in 46 blocks (0 SVs, 0 regions) ...Hdiv v. Hdiv: 227 hits (221 anchors) in 14 blocks (0 SVs, 0 regions)

############################

Building synteny-constrained orthogroups ... Done!

############################

Integrating syntenic positions across genomes ... ############## Generating syntenic dotplots ... Done! ############## Interpolating syntenic positions of genes ... alitta: (0 / 1 / 2 / >2 syntenic positions) Hdiv : 68 / 115 / 0 / 0 alitta : 0 / 246 / 0 / 0 platynereis: 19 / 107 / 0 / 0 platynereis: (0 / 1 / 2 / >2 syntenic positions) Hdiv : 80 / 103 / 0 / 0 alitta : 101 / 118 / 0 / 0 platynereis: 0 / 239 / 0 / 0 Hdiv: (0 / 1 / 2 / >2 syntenic positions) Hdiv : 0 / 221 / 0 / 0 alitta : 106 / 113 / 0 / 0 platynereis: 36 / 90 / 0 / 0 Done!

############################

Final block coordinate calculation and riparian plotting ... ############## Calculating syntenic blocks by reference chromosomes ... n regions (aggregated by 25 gene radius): 68 n blocks (collinear sets of > 5 genes): 72 ############## Building ref.-phased blks and riparian plots for haploid genomes: Error in [.data.table(bedm, , :=(reford, (1:.N)/.N), by = "genome") : Supplied 2 items to be assigned to group 1 of size 0 in column 'reford'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code. In addition: Warning messages: 1: In mclapply(1:nrow(chnk), mc.cores = nCores, function(i) { : all scheduled cores encountered errors in user code 2: In mclapply(1:nrow(chnk), mc.cores = nCores, function(i) { : all scheduled cores encountered errors in user code

jtlovell commented 7 months ago

This doesn't look like a eukaryote ... not nearly enough genes.

alitta : 246 / 246 geneIDs exactly match (PASS)
Hdiv : 221 / 221 geneIDs exactly match (PASS)
platynereis: 239 / 239 geneIDs exactly match (PASS)

There are a number of warnings about these genomes too ... my guess is there is something up with your genomes.

KBeine commented 7 months ago

I used the genomes from alitta and platynereis from NCBI, neither of them had gff files, Hdiv is our own genome, we have annotated it. I used the busco function: busco -i genome -o busco_output -l eukaryota_odb10 -m genome -e 1e-06 --limit 1, created the bed file from the full_table.tsv and concatenated the .faa files from the single_copy_busco_sequences, it was matching to 255 genes, which all three species had above 90% match for the busco genes. Will this be too few genes to map with genescape?

jtlovell commented 7 months ago

Yeah, this is too few. You definitely don't want to just use BUSCO genes. Just do a full annotation (or liftover) and use that. 100s of genes per chromosome is a good starting point. With less than that, it can be challenging to find synteny except in very closely related genomes.

jtlovell / GENESPACE

Using fasta (DNA sequences) and gff file from busco genes #150

GENESPACE prints a ton of output. We don’t provide a mechanism to suppress this because it is crucial to make sure things are looking ok. There are some descriptions about the text printed to the console (see below output …)

Checking required programs are installed

Dividing up work for BLAST for parallel processing

Running diamond all-versus-all

Running OrthoFinder algorithm

Writing orthogroups to file

Analysing Orthogroups

Calculating gene distances

Inferring gene and species trees

Best outgroup(s) for species tree

Reconciling gene trees and species tree

Writing results files

Chunk 1 / 1 (06:50:14 PM) ...

Chunk 1 / 1 (06:50:16 PM) ...