mjy / obo_parser

An OBO file parser.
MIT License
6 stars 6 forks source link

Feature Request: OBO Enumerator #3

Open Thyra opened 4 years ago

Thyra commented 4 years ago

I just tried parsing the Gene Ontology which is pretty huge, and my MacBook almost had a heart attack. What do you think about adding a method that enables lazy parsing (i.e. only parse one Term at a time, whenever somebody asks for it), either as a separate method or perhaps with an option such as lazy=true? I've been using this super-simple python OBO-parser until now but now that I'm trying to package my software into a Ruby Gem, of course everything should be just Ruby. And I don't like the idea of having multiple obo_parser gem equivalents floating around or that everybody starts their own thing from scratch. I understand this would make many of the sanity/crossref checks impossible but I only care about very specific parts of the terms anyway and would rather have it parse quickly than safely in this case.

Thyra commented 4 years ago

I just noticed: I think what I'm asking for is actually not a lazy way of parsing but a transient one, where only one Stanza is kept in memory at each time. Something like iterate_over_obo(IO) that would return an Enumerator, usable like this:

iterate_over_obo(File.open("go.obo")).each do |term|
  puts term.id.value
  # ...
end

names = iterate_over_obo(File.open("go.obo")).map do |term|
  term.name.value
end

Do you think that would be a valuable addition to the gem and would you be able to implement it?