s312569 / clj-biosequence

A Clojure library designed to make the manipulation of biological sequence data easier.
77 stars 11 forks source link

Genbank record as map? #35

Open averagehat opened 8 years ago

averagehat commented 8 years ago

The look-up functions for genbank records are useful, but I am finding it difficult to work with the genbank records in this fashion. Is there any way within the API (or recommended method) to get the record as a plain clojure map?

s312569 commented 8 years ago

Hi Mike

In a way they are already just maps (records behave as maps for most intents and purposes) there is a lot of data in that particular genbank format and so I really just shove the xml into a genbank record and access what I want using zippers etc. The xml is represented as a record as well but as I said there is a lot of data in that format so it makes for a complicated data structure. The accessors are just ones I have found useful. So I guess there are two things that could be done:

  1. Just write a function that takes the genbank record and extracts what you want into a map and include that in your workflow. What data are you interested in? Is it generally the same? If you just want a fasta representation you could just use 'init-fasta-sequence' and use the accessors to fill out the arguments.
  2. If you aren't using most of the data in the genbank record it might be more efficient to use a different genbank format - there are a few, some of which are small and might contain the information you are after. If you let me know what data you want to extract from each sequence I can have a look around and include a parser for that format.

Cheers Jason

averagehat commented 8 years ago

Thanks Jason. The project requirements changed, so I don't need what I thought I needed.

I guess what I was thinking of at the time was flattening the XML map so it would swap the :content and :tag keys and be more like a normal map, but upon consideration I'm not sure that is useful.

I had trouble grokking the readme: not knowing much about protocols the protocol tree wasn't obvious. I think it would help to make the namespace explicit in the examples; I found myself looking in wrong namespace because the function names are the same.

Because it's useful to have an external perspective on documentation I can make a PR for this and maybe some other documentation ideas. Once that is done I can think about whether or not another view of the genbank data would help.

s312569 commented 8 years ago

Yeah the original idea was to have a common interface for all sequence formats but what I've ended up with is somewhat more confusing than what I had intended. Inevitable really given the many differences in the formats. So documentation ideas are very welcome.

I'll go over the documentation and try and make it more user friendly - I've also toyed with just splitting each format parser into a separate library ...