knmnyn / ParsCit

An open-source CRF Reference String Parsing Package
http://wing.comp.nus.edu.sg/parsCit
GNU Lesser General Public License v3.0
155 stars 47 forks source link

Can we accelerate the running speed of ParsCit? #31

Closed betterRunner closed 6 years ago

betterRunner commented 6 years ago

Dear ParsCit authors,

I have tested the script citeExtract.pl and the result is outstanding. But I got a little confused about the running speed, that is: The script citeExtract.pl takes 3 to 5 seconds to finish parsing one paper where one-core of cpu would reach 100%. Is that a normal performance? If yes, is there any method to accelerate the running speed? Because we want to use ParsCit as part of our web service, 3 to 5 second with 100% cpu usage seems too high for our server.

Thank you very much!

cmkumar87 commented 6 years ago

Hi @betterRunner

Thanks for your query and using our software!

Have you tried running citeExtract.pl with option for extracting just the reference string instead of the -m extract_all option? This should be fast as it runs only the one linear CRF model for 'reference string parsing'. Whats your application? Is your input raw text or xml?

knmnyn commented 6 years ago

Yes, that is normal performance if used on a normal desktop system. We unfortunately, do not have bandwidth to publish about speed optimizations. Our neural model is being developed, and although it runs slower and has a larger memory footprint, is currently being developed and supported.

https://github.com/WING-NUS/Neural-ParsCit

betterRunner commented 6 years ago

Thanks for the fast reply, @cmkumar87, @knmnyn

Actually my application needs not only the function of "reference string parsing" but also the citation extraction from the whole text. So I think the speed would be much the same as extract_all.

As @knmnyn introduced, the neural model runs slower, I am wondering why the deep learning method even runs slower since a lot of deep learning methods on image processing run faster than the traditional ones (sorry if the comparison is unreasonable). And can its speed exceed the ParsCit in the future?

Thank you very much!

cmkumar87 commented 6 years ago

@betterRunner Thanks! The citation marker extraction is not part of the model at all. We find citation markers and extract the context based length in terms of bytes. I can point to this in the codebase if you like.

betterRunner commented 6 years ago

@cmkumar87 , sorry for the late reply, I tried your advice and just take the 'citation extraction' instead of 'extract all', and the speed is faster, though it still needs 3~5 seconds to finish. As about the citation marker extraction I think I can use other ways to achieve. Thank you very much!