Open jkeck opened 8 years ago
I think the profiling may be misleading -- it looks like the expensive part is calling #inspect
on the record and spitting out the result. Can you try to refine the profiling to something that better represents real-world use?
require 'ruby-prof'
require 'mods'
require 'faraday'
purl = 'https://purl.stanford.edu/pj169kw1971.mods'
mods_file = Faraday.get(purl).body
result = RubyProf.profile do
record = Mods::Record.new.from_str(mods_file)
record.inspect; 1
end
"Mods Version: #{Mods::VERSION}"
RubyProf::FlatPrinter.new(result).print(STDOUT)
Perhaps. To be totally honest, I'm not a profiling expert. Maybe I should have just left that data out (because it is possibly misleading) but I figure somebody who is more facile with profiling could take this and run with it. I guess more than anything I just wanted to point out that this issue existed (I'll leave the data out next time).
I don't quite understand how calling Mods::Record.new
isn't real-world use. It's what is described in the usage section on the README of the gem, but maybe I'm missing something.
That being said, simply calling Mods::Record.new
with that MODS takes a very long time (> 1 minute). I thought that the fact that it took so long to instantiate the class was indicative of a problem (and adding other method calls to fetch data out of the MODS could cloud up the results of the original problem). I'll add a more basic example with simple timing.
require 'mods'
require 'faraday'
purl = 'https://purl.stanford.edu/pj169kw1971.mods'
mods_file = Faraday.get(purl).body
start_time = Time.now
record = Mods::Record.new.from_str(mods_file)
puts record.sort_title.inspect
end_time = Time.now
puts "We took #{end_time - start_time} seconds to get the sort_title"
We took 84.352694 seconds to get the sort_title
So in this case I want to get the sort title of this record. In order to do so it took more than 84 seconds.
I'll leave what to profile up to somebody who may be more knowledgable in that area (because clearly I am not).
Sorry -- what I mean is, Mods::Record.new
takes almost no time at all (here, I'm seeing it take about 0.3s to run your code if we don't spit it out to the screen). It's where you're outputting the result for IRB that you're taking a very long time, and I'm guess /that's/ not how it's used in the real world. Instead, we'd be extracting many little bits of data.
So when you run my code example in https://github.com/sul-dlss/mods/issues/20#issuecomment-155168261 you're getting the sort_title output in 0.3s? I think that's a typical use case, store the record to a variable, call a method on that object to extract a piece of data (the sort_title
). That data isn't particularly large so it should not be expensive to spit it out to IRB.
(this may be specific to records w/ many subjects, which is the case in the example I'm using)
In looking into performance issues reported in METADOR-56 I have started doing a little profiling in the Mods/StanfordMods/ModsDisplay stack.
I think my profiling has indicated a pretty serious performance issue in the Mods gem when instantiating new records from large-ish record.
Example XML: https://purl.stanford.edu/pj169kw1971.mods (43.7 KB) (This record has about 280 subject elements in it)
Simply instantiating a new
Mods::Record
viafrom_str
takes more than 1 minute. Admittedly 43.7 KB is not that large of an XML file so I'm inclined to believe that something else is going awry somewhere here or in one of this gem's dependencies.Here is my profiling run:
Result