Closed nufosmatic closed 5 months ago
I see - thanks for bringing this to my attention! I am not sure when I will be able to take a look but it's on my radar now, at least. I'll look into it when I am able, and am of course open to suggestions.
I've not gotten any response from ufocenter.com in my queries. And I'm not finding any way to scrape the new website.
I did not note my own website: http://nufosmatic.infinityfreeapp.com/
Hello,
I have run in the same problem.
These are the nuforc changes that I have observed:
I have made a fork and checked in changes to resolve all those problems.
https://github.com/valerioa/nuforc_sightings_data
Now everything runs smoothly and data is processed correctly.
I refactored the code to adapt to the nuforc changes. I have found a way to structure the unstructured data of the stats page. Duration is now taken from the stats page.
It should be compatible with the previous version. The only thing I changed is the formatting of the stats column.
The stat column now is pipe delimited with
<field name>:<value>|<field name>:<value>|<field name>:<value>|<field name>:<value>|
example:
"Occurred:2021-08-19 18:00:00 Local|Location:Dallas, TX, USA|Shape:Unknown|Duration:2 minutes|No of observers:2|Reported:2021-08-20 12:49:56 Pacific|Posted:2021-08-20 00:00:00|Characteristics:Lights on object, Aura or haze around object, Aircraft nearby"
@timothyrenner - please let me know if you want a pull request.
@valerioa Wow that's amazing! If you're cool with a pull request I would be happy to check it out.
Or if you'd rather maintain the fork separately that is fine too.
I have been really busy this past year and have not been able to do open source as much as I'd like, so I really appreciate the assistance here.
Probably better a pull request, so the people who cloned your repo can benefit too.
I just made the pull request. Let me know.
I do want to caution that this behavior is specifically prohibited in the ToS. That said, I want to analyze the data. Has anyone tried getting in contact with them to set up a solution?
What Does the ToS say? I'm not a ToS person. If it's on the web, it's public. If you stand naked in front of an open window and I look, do not blame me for looking. They have a robots.txt and scrapy abides by robots.txt. That's enough for me.
It says that you need to ask their permission, basically. Their ToS is fairly easy to get through. I just emailed the CTO, so I'll let you all know what he says once he responds.
This is what I call the "tyranny of the ToS". We need to demystify and de-emphasize ToSes: a ToS is legally enforceable against me or you only if we signed it. I did not sign it. My browser landed on their page. So did my python program. The rest is just hot-air.
It's not about that - this is devolving into an ethics issue. I strongly believe in consent in all interactions that involve anyone. It literally took 10 seconds to find the person to email and ask 'Hey, Can I use your data?"
Interactions among humans are regulated by laws, and, ultimately, the Constitution. You are certainly entitled to your set of beliefs (Jacques Derrida said everyone is entitled to his own version of reality - paraphrased. Orwell said "Reality exists only in people's minds. The Party owns people's minds, thus the Party owns reality). Please do not impose your reality on others.
Anyhow, I came here to say that the code was not working. Now it does.
@djschlicht thank you for pointing out the terms of service. When I wrote this code (7 years ago!) the site was very different, and at the time I was unable to find them or they weren't there at all. I will also reach out to the CTO and ask about it.
So, a couple of things:
This is their robots.txt - this robots.txt says crawl and download whatever you want. It would be easy for them to block every form of download.
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-content/uploads/wpforms/
Sitemap: https://nuforc.org/sitemap.xml
Sitemap: https://nuforc.org/sitemap.rss
I have been able to assemble a scapy/selenium framework to scrape the NUFORC site. I am thinking of posting the code at nufosmatic.infinityfreeapp.com after I have determined it's bullet-proof. Much of my post-processing has to be re-arranged as well.
Note that we are not the only people making use of the NUFORC data, and that NUFORC, in turn, is using MUFON data...
If you know how to get past Wordpress Javascript to get the second+ page of data you might be able to recover.
nufosmatic@cox.net