usgpo / bulk-data

User Guides for XML on the govinfo Bulk Data Repository. For information about Bill Status XML Bulk Data, see https://github.com/usgpo/bill-status.
https://www.govinfo.gov/bulkdata
266 stars 97 forks source link

Bulk Data XML Pages Loading Very Slow #42

Closed Tom1247 closed 4 years ago

Tom1247 commented 4 years ago

We have a number of applications that rely on the content of the bulk data XML pages containing US Congressional legislation information. One app in particular normally takes ~20 minutes to run, and did so on Sunday October 20 in the afternoon. Sunday night it took ~2 hours, Monday October 21 (yesterday) it took ~4 hours, and today it has been running for more than six hours and has not completed. Are there any known performance problems with USGPO servers? If not, then can someone please share some insight as to why the exact same code that has been running in 20 minutes or so is now taking 20 times or more longer to obtain your bulk data in XML? Thanks!

jonquandt commented 4 years ago

@Tom1247 -- sorry to hear you're having trouble accessing content from the bulk data repository. I'm not aware of any performance issues at the moment, but we're digging into

Can you provide a little more information?

Tom1247 commented 4 years ago

@jonquandt Thanks for the prompt attention to this matter. We pull individual items - https://www.govinfo.gov/bulkdata/BILLSTATUS/116/hr/BILLSTATUS-116hr5.xml - for example as our apps determine our data may need updated. Once we have the XML we parse it within our code (no further calls to the same govinfo.gov page) and update our database accordingly.

jonquandt commented 4 years ago

@Tom1247 - got it.

I see that over the weekend there appears to have been a large-scale update to the BILLSTATUS collection. Looks like there were 3482 updates/additions since the 19th, when there are usually only a few hundred on a given day.

https://api.govinfo.gov/collections/BILLSTATUS/2019-10-19T00:00:00Z?pageSize=100&offset=0&api_key=DEMO_KEY

when you've seen larger updates in the past, does that usually result in longer run times for your application?

Tom1247 commented 4 years ago

@jonquandt We've seen some minor slow downs on rare occasions in the past, but nothing like this one. Maybe a ~40 minute run time for this one app instead of the typical 20 or so. Also, given we are only pulling a single item from the repository at a time would an update such as you speak of impact the load times we would see for a single page?

jonquandt commented 4 years ago

@Tom1247 - I misunderstood - you're seeing slow load times for individual files, not the overall job.

When you try to access that file from your computer or another device (not from your server running the job) , are you seeing the slow load times?

I'm unable to recreate slow load times when accessing from multiple devices, both internal and external to our network.

Tom1247 commented 4 years ago

@jonquandt First, I'm seeing the overall job take far, far longer than typical. On Sunday afternoon (after the large update you spoke of) we probably pulled ~6,000 bulk data XML pages and parsed them in ~20 minutes; today, that same approximate number of page requests is still running after 8 hours. If I simply drop the URL into a browser on any number of machines I'm seeing 5 to 6 seconds per page to simply render the XML from your servers - which seems quite long - and our pages per minute processing stats are showing a huge drop today.

Tom1247 commented 4 years ago

@jonquandt Also, FWIW, our empirical baseline is solid - same code, same server, same database, same network, same resources (which are substantial) - and we are not seeing any indication of internal issues or bottlenecks that could cause the slow down. Thanks!

Tom1247 commented 4 years ago

@jonquandt It appears that things have returned to normal. Our apps are running at expected duration and even manual loading of individual XML pages is far, far faster - and more typical - this morning.

jonquandt commented 4 years ago

Thanks for the update, @Tom1247 - last night, we identified an issue with some requests being delayed for some users due to a configuration issue. We resolved the configuration issue and are monitoring for now while we look at additional long-term solutions.

Are you using the BILLSTATUS bulkdata sitemaps or directly crawling the pages under https://www.govinfo.gov/bulkdata/BILLSTATUS?

Tom1247 commented 4 years ago

@jonquandt We direct crawl the individual pages and things are definitely back to normal. Thanks!

jonquandt commented 4 years ago

got it -- you may also see a slight performance boost by going to the /xml endpoint for that -- it's the same content and format of the xml pages that are served at /bulkdata/BILLSTATUS, but without the stylesheets and other resources being served.

https://www.govinfo.gov/bulkdata/xml/BILLSTATUS

from some spot checking, it appears that the stripped-down xml version usually comes back in half a second, while the styled version takes 2-3 seconds.

There's also a json version as well

https://www.govinfo.gov/bulkdata/json/BILLSTATUS

Of course, the API could give you updated items as well, filtered by Congress, but that might be a larger effort to change to than looking at adding /xml/ to your existing request pattern

Anyways, glad it's working for you now.

Tom1247 commented 4 years ago

@jonquandt Thanks for the tip but we're already using the raw XML and you're right about the load times.