thackl / gggenomes

A grammar of graphics for comparative genomics
https://thackl.github.io/gggenomes/
Other
579 stars 64 forks source link

Too many col_names #98

Closed olekto closed 2 years ago

olekto commented 2 years ago

Hi, I'm really excited about this package. Finally a customisable tool that can draw nice synteny figures!

However, I jumped right into my use-case, and started with trying to draw three homolog chromosomes from some fishes. When reading in the sequences (group6_seq = read_fai("group6.fasta")), I get this warning:

Warning message:                                                                                                                                     
Too many col_names, ignoring extra ones 

When trying to plot this, I get this:

p1 <- gggenomes(group6_seq) +
+   geom_seq() + geom_bin_label()
No seqs provided, inferring seqs from feats

And then I get this error:

Error: Problem with `summarise()` column `length`.
ℹ `length = max(start, end)`.
x invalid 'type' (closure) of argument
ℹ The error occurred in group 1: bin_id = ">Chr9", seq_id = ">Chr9".
Run `rlang::last_error()` to see where the error occurred.

My R-fu isn't that strong, so maybe I messed up somewhere. However, I was wondering what limitations there are regarding sizes of the sequences that I want to read in. These are about 35 Mbp each, totalling 100 Mbp. I don't really want to look at genes and such in the first round, but more chromosome structure (inversions, centromeres, etc). Is there any reason why this shouldn't work?

Thank you.

Sincerely, Ole

thackl commented 2 years ago

Hi Ole,

happy to hear that you like the idea of the package! And looking more at chromosome-scale rather than genes is definitely an interesting use case. I'd be curious about your experience going forward because this is something that people don't do that often and so I get less feedback on this kind of use than just on plotting genes.

Try use read_seqs("groups6.fasta") to read fasta files.

read_fai() is a low-level parser for fasta index files. Those are tab-separated files generated by samtools faidx/seqkit faidx.

The size of the sequences/fasta file does not matter. gggenomes essentially ignores the sequences themselves, just reads the headers and counts the length of the sequences.

olekto commented 2 years ago

Hi Thomas, that worked nicely.

I had generated fai files also, so it looked like it parsed those. Not sure what happened there.

Ole

thackl commented 2 years ago

read_fai() just reads text files line by line and expects a seq_id in the first col and a seq length in the second. That also works on a plain fasta file, though makes no sense. The error message you got was misleading. If you read a fasta file directly with read_fai, it just reads one column instead of the expected two.

You can use read_seqs also on different files including .fai index and .fasta files. It will internally call the right parser, i.e. read_fai for .fai, read_seq_len() for fasta, gff or genbank. If you already have .fai files, reading those is a lot faster than reading .fasta files.