timothyrenner / nuforc_sightings_data

Data collection and processing for the National UFO Reporting Center (NUFORC) database.
MIT License
35 stars 9 forks source link

Given the extreme format change after August 2023, your scripts no longer work #21

Closed nufosmatic closed 5 months ago

nufosmatic commented 7 months ago

If you know how to get past Wordpress Javascript to get the second+ page of data you might be able to recover.

nufosmatic@cox.net

timothyrenner commented 7 months ago

I see - thanks for bringing this to my attention! I am not sure when I will be able to take a look but it's on my radar now, at least. I'll look into it when I am able, and am of course open to suggestions.

nufosmatic commented 7 months ago

I've not gotten any response from ufocenter.com in my queries. And I'm not finding any way to scrape the new website.

I did not note my own website: http://nufosmatic.infinityfreeapp.com/

valerioa commented 5 months ago

Hello,

I have run in the same problem.

These are the nuforc changes that I have observed:

  1. the HTML of all pages have been modified
  2. date format in the “by posted date” page has been changed
  3. data table has been moved in the “by posted date” page
  4. “duration” is not available anymore in the date index page
  5. table in the “stats” page was removed. Now data is free formatted HTML

I have made a fork and checked in changes to resolve all those problems.

https://github.com/valerioa/nuforc_sightings_data

Now everything runs smoothly and data is processed correctly.

I refactored the code to adapt to the nuforc changes. I have found a way to structure the unstructured data of the stats page. Duration is now taken from the stats page.

It should be compatible with the previous version. The only thing I changed is the formatting of the stats column.

The stat column now is pipe delimited with and separated by a colon. This is for ease of further parsing and analysis.

<field name>:<value>|<field name>:<value>|<field name>:<value>|<field name>:<value>| example:

"Occurred:2021-08-19 18:00:00 Local|Location:Dallas, TX, USA|Shape:Unknown|Duration:2 minutes|No of observers:2|Reported:2021-08-20 12:49:56 Pacific|Posted:2021-08-20 00:00:00|Characteristics:Lights on object, Aura or haze around object, Aircraft nearby"

@timothyrenner - please let me know if you want a pull request.

timothyrenner commented 5 months ago

@valerioa Wow that's amazing! If you're cool with a pull request I would be happy to check it out.

Or if you'd rather maintain the fork separately that is fine too.

I have been really busy this past year and have not been able to do open source as much as I'd like, so I really appreciate the assistance here.

valerioa commented 5 months ago

Probably better a pull request, so the people who cloned your repo can benefit too.

I just made the pull request. Let me know.

djschlicht commented 5 months ago

I do want to caution that this behavior is specifically prohibited in the ToS. That said, I want to analyze the data. Has anyone tried getting in contact with them to set up a solution?

valerioa commented 5 months ago

What Does the ToS say? I'm not a ToS person. If it's on the web, it's public. If you stand naked in front of an open window and I look, do not blame me for looking. They have a robots.txt and scrapy abides by robots.txt. That's enough for me.

djschlicht commented 5 months ago

It says that you need to ask their permission, basically. Their ToS is fairly easy to get through. I just emailed the CTO, so I'll let you all know what he says once he responds.

https://nuforc.org/terms/

valerioa commented 5 months ago

This is what I call the "tyranny of the ToS". We need to demystify and de-emphasize ToSes: a ToS is legally enforceable against me or you only if we signed it. I did not sign it. My browser landed on their page. So did my python program. The rest is just hot-air.

djschlicht commented 5 months ago

It's not about that - this is devolving into an ethics issue. I strongly believe in consent in all interactions that involve anyone. It literally took 10 seconds to find the person to email and ask 'Hey, Can I use your data?"

valerioa commented 5 months ago

Interactions among humans are regulated by laws, and, ultimately, the Constitution. You are certainly entitled to your set of beliefs (Jacques Derrida said everyone is entitled to his own version of reality - paraphrased. Orwell said "Reality exists only in people's minds. The Party owns people's minds, thus the Party owns reality). Please do not impose your reality on others.

Anyhow, I came here to say that the code was not working. Now it does.

timothyrenner commented 5 months ago

@djschlicht thank you for pointing out the terms of service. When I wrote this code (7 years ago!) the site was very different, and at the time I was unable to find them or they weren't there at all. I will also reach out to the CTO and ask about it.

So, a couple of things:

  1. The ToS covers usage of the code, not the code itself. Which is an irritating and pedantic distinction but worth calling out because it means wrt to the present discussion, it isn't technically relevant. The code is the code, not its usage. Regarding the dataset on data.world and elsewhere, that is a very different discussion, but not the discussion here.
  2. Thus far, I have not been contacted by anyone from NUFORC. I try to be a good actor when it comes to obtaining datasets, and if I am asked to stop hosting the dataset or scraping the website I would. That has not been the case.
valerioa commented 5 months ago

This is their robots.txt - this robots.txt says crawl and download whatever you want. It would be easy for them to block every form of download.

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-content/uploads/wpforms/

Sitemap: https://nuforc.org/sitemap.xml
Sitemap: https://nuforc.org/sitemap.rss
nufosmatic commented 4 months ago

I have been able to assemble a scapy/selenium framework to scrape the NUFORC site. I am thinking of posting the code at nufosmatic.infinityfreeapp.com after I have determined it's bullet-proof. Much of my post-processing has to be re-arranged as well.

Note that we are not the only people making use of the NUFORC data, and that NUFORC, in turn, is using MUFON data...