pgh-public-meetings / city-scrapers-pitt

Pittsburgh City Scrapers: sourcing public meetings in Pittsburgh
https://pgh-public-meetings.github.io/events/
MIT License
19 stars 66 forks source link

Add pa_mt_lebanon spider #200

Closed maxachis closed 3 years ago

ben-nathanson commented 3 years ago

For consistency, this spider should be renamed to mt_lebo_public_meetings. pa_mt_lebo_public_meetings lumps it in with other state-wide agencies and could potentially be confusing.

ben-nathanson commented 3 years ago

There are 46 files associated with this - mostly related to tests - can we get rid of any them without affecting regression tests? If I takes more than 20 minutes don't worry about it but that's a lot of info.

maxachis commented 3 years ago

There are 46 files associated with this - mostly related to tests - can we get rid of any them without affecting regression tests? If I takes more than 20 minutes don't worry about it but that's a lot of info.

After looking into this, the answer is yes, with the caveat that unless we modify the HTML file itself, we can't remove the "pa_mt_lebanon_files" directory. The HTML file references that a lot--46 times--and even if it doesn't get the files in the folder, so long as it can reference the folder itself, it functions fine. If the folder isn't there, though, it throws an error.

If it seems stupid to keep an empty folder, the answer is yes, but it's either that or painstakingly removing the HTML references, which I'm willing to do if necessary (especially since it might confuse people working on this later on as to why we have an empty folder), but which seems so tightly wound in the thoroughly convoluted HTML code that resolving that might be more trouble than it's worth.

maxachis commented 3 years ago

When I ran scrapy crawl pa_mt_lebanon I found an IndexError click here for logs.

This was related to what I was saying about the rickety nature of the website -- and hence, at least previously, the spider. I modified the spider to be a bit more flexible. In this case, it's looking more for keywords rather than specific id's, which are either sparse or a sequence of near-random-seeming characters.

bonfirefan commented 3 years ago

Max, great spider. I would note that you can check for month date patterns without having to specify them - you could use something clever with datetime's %b tag which recognize abbreviated months like Jan / Feb / Mar