Closed Tom1247 closed 4 years ago
@Tom1247 -- sorry to hear you're having trouble accessing content from the bulk data repository. I'm not aware of any performance issues at the moment, but we're digging into
Can you provide a little more information?
@jonquandt Thanks for the prompt attention to this matter. We pull individual items - https://www.govinfo.gov/bulkdata/BILLSTATUS/116/hr/BILLSTATUS-116hr5.xml - for example as our apps determine our data may need updated. Once we have the XML we parse it within our code (no further calls to the same govinfo.gov page) and update our database accordingly.
@Tom1247 - got it.
I see that over the weekend there appears to have been a large-scale update to the BILLSTATUS collection. Looks like there were 3482 updates/additions since the 19th, when there are usually only a few hundred on a given day.
when you've seen larger updates in the past, does that usually result in longer run times for your application?
@jonquandt We've seen some minor slow downs on rare occasions in the past, but nothing like this one. Maybe a ~40 minute run time for this one app instead of the typical 20 or so. Also, given we are only pulling a single item from the repository at a time would an update such as you speak of impact the load times we would see for a single page?
@Tom1247 - I misunderstood - you're seeing slow load times for individual files, not the overall job.
When you try to access that file from your computer or another device (not from your server running the job) , are you seeing the slow load times?
I'm unable to recreate slow load times when accessing from multiple devices, both internal and external to our network.
@jonquandt First, I'm seeing the overall job take far, far longer than typical. On Sunday afternoon (after the large update you spoke of) we probably pulled ~6,000 bulk data XML pages and parsed them in ~20 minutes; today, that same approximate number of page requests is still running after 8 hours. If I simply drop the URL into a browser on any number of machines I'm seeing 5 to 6 seconds per page to simply render the XML from your servers - which seems quite long - and our pages per minute processing stats are showing a huge drop today.
@jonquandt Also, FWIW, our empirical baseline is solid - same code, same server, same database, same network, same resources (which are substantial) - and we are not seeing any indication of internal issues or bottlenecks that could cause the slow down. Thanks!
@jonquandt It appears that things have returned to normal. Our apps are running at expected duration and even manual loading of individual XML pages is far, far faster - and more typical - this morning.
Thanks for the update, @Tom1247 - last night, we identified an issue with some requests being delayed for some users due to a configuration issue. We resolved the configuration issue and are monitoring for now while we look at additional long-term solutions.
Are you using the BILLSTATUS bulkdata sitemaps or directly crawling the pages under https://www.govinfo.gov/bulkdata/BILLSTATUS?
@jonquandt We direct crawl the individual pages and things are definitely back to normal. Thanks!
got it -- you may also see a slight performance boost by going to the /xml endpoint for that -- it's the same content and format of the xml pages that are served at /bulkdata/BILLSTATUS, but without the stylesheets and other resources being served.
https://www.govinfo.gov/bulkdata/xml/BILLSTATUS
from some spot checking, it appears that the stripped-down xml version usually comes back in half a second, while the styled version takes 2-3 seconds.
There's also a json version as well
https://www.govinfo.gov/bulkdata/json/BILLSTATUS
Of course, the API could give you updated items as well, filtered by Congress, but that might be a larger effort to change to than looking at adding /xml/
to your existing request pattern
Anyways, glad it's working for you now.
@jonquandt Thanks for the tip but we're already using the raw XML and you're right about the load times.
We have a number of applications that rely on the content of the bulk data XML pages containing US Congressional legislation information. One app in particular normally takes ~20 minutes to run, and did so on Sunday October 20 in the afternoon. Sunday night it took ~2 hours, Monday October 21 (yesterday) it took ~4 hours, and today it has been running for more than six hours and has not completed. Are there any known performance problems with USGPO servers? If not, then can someone please share some insight as to why the exact same code that has been running in 20 minutes or so is now taking 20 times or more longer to obtain your bulk data in XML? Thanks!