ropensci-archive / bomrang

:warning: ARCHIVED :warning: Australian government Bureau of Meteorology (BOM) data client for R
Other
109 stars 26 forks source link

BOM denying any HTTP requests #138

Closed adamhsparks closed 1 year ago

adamhsparks commented 3 years ago

BOM is now denying any HTTP requests made using a method other than a web-browser. http://reg.bom.gov.au/weather-services/announcements/

Website notification of change Scheduled Release Date: 3 March 2021 A web application firewall policy has been implemented for www.bom.gov.au which will block screen scraping activity. The Bureau is monitoring screen scraping activity on the site and will commence interrupting, and eventually blocking, this activity on www.bom.gov.au from Wednesday, 3 March 2021. This is aimed at protecting infrastructure, system access and security, intellectual property and server/service load. Web or screen scraping is the act of copying information that shows on a digital display so it can be used for another purpose. This activity has always been at odds with the Bureau's terms and conditions. We understand www.bom.gov.au contributes significantly to the work of many individuals and organisations and we are committed to continuing to provide access through our registered user’s channel. For further information, or to discuss the ongoing use of our materials, please make contact with us via weatherquestions@bom.gov.au.

This directly affects:

These functions effectively not functioning any longer due to BOM denying access. Note that the first two appeared here: http://www.bom.gov.au/catalogue/data-feeds.shtml, which to me didn't mean scraping as defined above since one is a .zip file and the other is .json (what else would you be doing with a .json file). However, to be fair the bulletins do appear to clearly fall into that category.

@HughParsonage, @jonocarroll, @mpadge. you're the authors of these functions. How would you like to proceed?

I don't consider spoofing the User Agent to be a valid response here either. If someone knows someone at BOM that we can talk to, that's one option that I see. The other is to simply remove the functionality since BOM seems to be completely against it.

I can fill out the form http://reg.bom.gov.au/screenscraper/screenscraper_enquiry_form/ and see if I even get a response, but never have any time I've tried before for other reasons.

adamhsparks commented 3 years ago

I have also received the obligatory flame from CRAN about bomrang failing and needing to be reminded of the policy about failing gracefully (as if I forgot). So this needs to be resolved by 28/04/2021 to keep bomrang on CRAN.

mpadge commented 3 years ago

I'm happy to formulate an email to the totally generic address they put there (weatherquestions@bom -really?) Their statements are really very obstructive:

Web or screen scraping is the act of copying information that shows on a digital display so it can be used for another purpose. This activity has always been at odds with the Bureau's terms and conditions.

It nevertheless seems that a lot of stuff is still available via FTP, as described here, but the " full list of all products available via anonymous FTP" leads to a bunch of pages that are completely unstructured and effectively useless. The observation bulletins are, for example, there, but the names are encoded according to a completely opaque system which we would have to reverse engineer. Once done, however, the files on the FTP server are the direct html files used to directly generate the bulletin tables. That should still all be possible and allowed, but would require an extra layer of complexity trying to figure out the nomenclature of their FTP naming systems.

The historical data are, however, not available via FTP, so no idea how to circumvent that. For reference, here is the Information Publication Scheme, and their Freedom of Information page.

HughParsonage commented 3 years ago

I do not understand the purpose of the .json files other than for this sort of access. Are they seriously suggesting that the intended means of accessing the data is to click on the link and copy it manually?

mpadge commented 3 years ago

Yes @HughParsonage they are seriously suggesting that should be the only method for anybody other than direct employees of da BoM. The problem is they have no system for authorized requests, which would solve everything, and in lieu of that some brains trust obviously decided excluding everybody is safer and easier.

adamhsparks commented 3 years ago

So, what's the course of action here with these three functions? They aren't going to work any longer unless BOM does an about-face, but the wording,

This activity has always been at odds with the Bureau's terms and conditions.

and

We understand www.bom.gov.au contributes significantly to the work of many individuals and organisations and we are committed to continuing to provide access through our registered user’s channel.

tells me that they know they've ruined it for everyone and don't give a stuff and a change isn't likely to happen.

I need to get the package fixed. Ripley has already flamed me for this: https://www.stats.ox.ac.uk/pub/bdr/donttest/bomrang.out

jonocarroll commented 3 years ago

If BOM don't want to support our use-case then it would be really good if they provided some sort of queryable API. The FTP options don't seem to have station-level data, at least not in a parseable way.

One alternative would be to create a data dump (either an official export or we run the queries while we still can) and somehow update that occasionally. It's all up in the air for the fetching functions so I'd say temporarily redirect those to errors, turn off all the tests for them, and resubmit to CRAN. This still leaves all the existing versions which are running and will continue to fail, so some announcements in the README and Twitter would be useful.

I'm keen to hear if anyone at BOM actually wants to see people using the data.

adamhsparks commented 3 years ago

I don't see how one can do a data dump given the current policies prohibit this.

I don't see what's up in the air here about fetching functions? If it uses HTTP as these three do, it's DOA with BOM's stated and enforced policies. For now, the FTP ones seem safe, but I'm not holding my breath.

jonocarroll commented 3 years ago

It wasn't a 'good citizen' suggestion, but if the files are available on the website then one could loop over those (with the right HTTPUserAgent) and grab the historical data*.

The 'up in the air' part referred to whether or not we could make a working wrapper for the FTP fetches but I'm not sure if a) it would even work, b) it would be stable, c) it would remain accessible, d) we could do the station alignment. I vote for (at least temporarily) disabling the failing functions and perhaps opening a line of communication with BOM.

*not an endorsement of said activity.

vam103 commented 3 years ago

In terms of the FTP there does seem to be a folder that has climate data over a number of years with a database of stations here ftp://ftp.bom.gov.au/anon/gen/clim_data/IDCKWCDEA0/

It seems to be fairly historical from 2009... but its probably still a pib to actually get it to work.

adamhsparks commented 3 years ago

I'd be happy for these three to be re-implemented using FTP if possible, but I've got two rather important items going at work right now that take precedent over this right now.

So, if someone wants to contact BOM, that would be grand. I just don't have the time right now or the inclination based on previous attempts with no response.

So, I need someone who's willing to:

HughParsonage commented 3 years ago

As upsetting as it might be, the correct course of action, at least in the meantime, may be to ask CRAN to archive the package. The purpose this package is to download data from a source that is no longer available. It's a shame, but not ours.

mpadge commented 3 years ago

I can fix the bulletin function to work in current form by switching to FTP, and should be able to find some time to do that next week. But that's only a very partial solution, and it's the historical data i worry about, along with @jonocarroll's concern about stability of provision of FTP data, especially since the front page there clearly states:

The Bureau does not guarantee the availability of information on the FTP site

How about i volunteer to:

  1. Fix get_weather_bulletin
  2. Formulate an email to the assuredly useful weatherquestions email and CC y'all in.

Can we list details of all other fail points, in terms of (i) current call and http endpoint; and (ii) equivalent data in the crappy FTP list. In the meantime, two further suggestions from my side would be:

  1. Someone with admin to this repo switch on discussions so we can use that to further opine, complain, and plan;
  2. Somebody volunteer themselves to quickly switch off all current CRAN fails and get a safer version back on.

(Finally, getting archived is no real biggie - i've had it happen a couple of times, but getting back on again was always very straightforward. bomrang is not used by any other packages, so nothing will break through it being archived for a short while.)

And if i have time beyond that, i'll make dprkweather just to provide a comparative demonstration that that's a darn site easier than Australia!! Dammit!

jonocarroll commented 3 years ago

I can try to find some time over the weekend to write a PR to (temporarily) disable get_historical_weather().

adamhsparks commented 3 years ago

@HughParsonage, not all of the functions are affected, only three are affected by this currently, so CRAN archiving seems excessive in this case. But it's an option until something is sorted I suppose. And I do agree with you, it's not on us if it is archived.

adamhsparks commented 3 years ago

@mpadge, discussions are now open

TristanLouthRobins commented 3 years ago

Yes @HughParsonage they are seriously suggesting that should be the only method for anybody other than direct employees of da BoM. The problem is they have no system for authorized requests, which would solve everything, and in lieu of that some brains trust obviously decided excluding everybody is safer and easier.

As someone who has used bomrang repeatedly as test data for honing my wrangling and analysis skills, this news is enormously disappointing. For me, the more manual workaround would be fine if BOM incorporated an updated means of downloading datasets (as .csv etc) but their pages are so horrendously dated and restricted to pdfs and manual copy/pastes. Additionally historical data sets presented on these pages are frequently days or weeks behind the current date.

Appreciate the work you're all putting into hopefully resolving this or providing alternatives.

gutzbenj commented 3 years ago

Hello everyone, we may have a solution for the problem you are facing here. The BOM service is based on some database structure developed by the Kisters Company. The interface can generally be accessed by kiwis_pie [1], which attemps to parse internal kiwis (that's the official interface name) methods and provides them as class methods, which one can then easily call. @amotl has setup a demo [2] at our repository "wetterdienst" where we try to streamline access procedures and implement new services such as BOM.

If you'd like to give it a shot and are fine with programming in Python this may be a feasible approach for getting this data in relatively short time.

Cheers, Benjamin

[1] https://github.com/amacd31/kiwis_pie [2] https://github.com/earthobservations/wetterdienst/blob/collab/bom/wetterdienst/provider/bom/demo.py [3] https://github.com/earthobservations/wetterdienst

mpadge commented 3 years ago

Thanks @gutzbenj, that would be a great idea, but unfortunately the BoM uses a Kisters database only for the water data. This bomrang package is about the weather data, which use an entirely different, BoM-internal system which they have now effectively prohibited external access to. What we were accessing is this front page, from which we used to be able to access full data like these. You'll see there that it's an entirely different kind of system to the water data, and one to which open and public access is now restricted or effectively prohibited. Alas.

gutzbenj commented 3 years ago

If I understood @amotl correctly, he said that although the KIWIS interface is formally only providing water data it as well allows the user to access to weather data (which probably is stored in the same database). I will check this later on and see what is available from there.

adamhsparks commented 3 years ago

While I’m all for using the right language for the right application, I’m not sure that this is the best place to use Python. I have used Python in R to fetch data, but that was just in a script, not a full-blown package, and it worked very well. If a user wanted to use Python to access the data in R, this could work, but I’m reluctant to integrate Python into an R package. Managing one language in a package is often enough work without wrangling an API through a second language. I’m not dismissing this outright, but if there’s a possibility to access the API using R, then I’m listening more intently.

mpadge commented 3 years ago

No worries @adamhsparks - I've had a look at the code, and it'll all be easy to port over to pure R if this turns out to be a workable approach.

buzacott commented 3 years ago

My little package R bomWater was accessing this API to retrieve BoM Water data. When I was poking around I didn't see any indication that it served anything else. There is some weather data, but it is only from whatever the hydrological stations are providing (i.e. not BoM weather stations). The water page is just aggregating the data from all the different state water services that can also be accessed using a similar API e.g. https://realtimedata.waternsw.com.au https://data.water.vic.gov.au/ , which makes me think that data collected by the BoM won't be available.

In case you can access other data, the code here may help design API queries in R: https://github.com/buzacott/bomWater/blob/master/R/bomWater.R

edit: I was going to note that access to the API was blocked before when the BoM cut off access to everyone, but it seems to be working now. I don't know if this will continue to be the case

CMurtagh-LGTM commented 2 years ago

There's a python library weather-au, which has managed to reverse engineer the api of the bom's new weather website. It gets json data from https://api.weather.bom.gov.au/v1.

jonocarroll commented 2 years ago

Neat, @CMurtagh-LGTM - I suspect we could use the same API for current observations, but we're still stuck for historical data.

trickypr commented 2 years ago

I am not sure if this has been mentioned yet, but the firewall uses the user agent. If you set the user agent string to be that of chrome, BOM will return the data.

mpadge commented 2 years ago

Thanks @trickypr, that may indeed work, but we'd still be violating the BoMs terms of usage which forbid "web scraping". My guess is we'll have to wait for some kind of appeal to legalise web scraping before they budge. In the meantime not much we can legally do here, short of resorting to their intentionally obfuscatory FTP server.

adamhsparks commented 2 years ago

Hi @trickypr, we're well aware. But @mpadge is correct, while it may be technically possible, many of us involved in the package's development and use are not in positions to circumvent BOM's opinion on this and the FTP site only offers incomplete data vs the http methods.