mamarjan / gff3-pltools

A fast parallel GFF3 parser
MIT License
15 stars 5 forks source link

replace_url_escaped_chars() too slow #30

Closed mamarjan closed 12 years ago

mamarjan commented 12 years ago

Tests with the 1GB file from Wormbase show that this function is probably a problem, when parsing a file with escaped characters. The file in question has multiple newline chars escaped in attributes fields for pretty much every record.

Instead of ~30 seconds as expected, the benchmark application went for 4-5 minutes until interrupted by me with Ctrl-C.

The function should be replaced with a similar with much better performance.

mamarjan commented 12 years ago

This is a more complicated issue then anticipated. For now the escaped characters replacement has been made optional, but activated by default. In the future a char[] version should be developed, and a lazy version too.

Because of that, flagging this issue for later.

pjotrp commented 12 years ago

I was thinking: can't you scan for escaped chars first? That should be fast. Next only translate a record that has the escaped characters - otherwise is can be raw.

The second option is to make it lazy - i.e. don't translate until a user requests the data. With filtering, the only field that needs to be checked is the record id. That would mean most fields never need to be translated. Right?

mamarjan commented 12 years ago

Regarding the first, you just described how it was working from the start.

I did think about the second option, but that is still something that would be done in the main thread. I would like to offload that thread as much as possible. In the filtering, the filtered value could be translated to it's escaped form and then used for filtering, so no problem there.

Flagged for later for now. The third option I'm pondering (making it based on char[] instead of string and then replace escaped chars in place) could interfere with threading. So I would like to get that going first to be able to test the third option.

pjotrp commented 12 years ago

How can the first option be slow? Assuming escaped chars are relatively rare, a really fast scan should not have such a large effect. A record is a char buffter, right? You need only find a valid escape symbol. That should be a fast search.

mamarjan commented 12 years ago

No, that is not the problem. The first option is currently implemented. The problem is that the 1GB file from Wormbase has 6.5 milion escaped characters.

pjotrp commented 12 years ago

What are these escaped characters? What are they for?

mamarjan commented 12 years ago

%0A, that is newlines. It's part of an attribute "info", here is an excerpt:

info=position:136-200 method:InterPro accession:IPR001356 description:Homeobox %0Aposition:138-193 method:InterPro accession:IPR008422 description:Homeobox KN domain %0Aposition:105-212

lomereiter commented 12 years ago

Probably you can borrow some ideas from std.uri module, there's a URI decode function, AFAIU you need a subset of its functionality: https://github.com/D-Programming-Language/phobos/blob/master/std/uri.d#L222

pjotrp commented 12 years ago

I think if it is lazy, you won't parse attributes at all. They are not used by the system. The speedup is by ignoring attributes, until someone wants them. If someone wants attributes with escaped characters, tough luck to him (speed-wise).

mamarjan commented 12 years ago

Thx Artem, seems like a very optimized piece of code.

The solution would probably be twofold: to make it lazy and to make a fast function that can be used by the user to replace escaped characters.

mamarjan commented 12 years ago

It's working, and fast enough for now, at least for the test files that I have, and per comments from Pjotr, if these escaped characters are indeed rare, then because of the current implementation, they won't nave any performance impact.