plagmada / plagmada-archives

2 stars 1 forks source link

Write Web Scraping Script to Harvest Existing Gallery2 Site #23

Open truephredd opened 8 years ago

truephredd commented 8 years ago

Using Python & Beautiful Soup, create a script that crawls the existing website to gather specified metadata and image data.

truephredd commented 8 years ago

Current status: I have logic that spans the gallery recursively, reaching each album page. Next up is to create a function that will collect the images and metadata associated with them.

truephredd commented 7 years ago

Current status: My code spans the entire archive correctly now. There'd been an issue that I needed to clean up, but it does it now (although the connection to the site isn't reliable enough that I've made it to the end yet).

I've also checked out what format is optimal for import into both Omeka and Collective Access, and the answer is the same: XML. So, now I just need to start adding the code that'll do that.

@kwhite2 @riverfr0zen @timhutchings

timhutchings commented 7 years ago

this sounds exciting!

On Sat, Nov 26, 2016 at 9:32 PM, Phredd Groves notifications@github.com wrote:

Current status: My code spans the entire archive correctly now. There'd been an issue that I needed to clean up, but it does it now (although the connection to the site isn't reliable enough that I've made it to the end yet).

I've also checked out what format is optimal for import into both Omeka and Collective Access, and the answer is the same: XML. So, now I just need to start adding the code that'll do that.

@kwhite2 https://github.com/kwhite2 @riverfr0zen https://github.com/riverfr0zen @timhutchings https://github.com/timhutchings

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/plagmada/plagmada-archives/issues/23#issuecomment-263103410, or mute the thread https://github.com/notifications/unsubscribe-auth/ATbkhFFgA80F_iP3kPZ3H0rnaiJger9bks5rCRXQgaJpZM4Kfdkm .

truephredd commented 7 years ago

Well, it's progress.

On Sun, Nov 27, 2016 at 5:24 PM, timhutchings notifications@github.com wrote:

this sounds exciting!

On Sat, Nov 26, 2016 at 9:32 PM, Phredd Groves notifications@github.com wrote:

Current status: My code spans the entire archive correctly now. There'd been an issue that I needed to clean up, but it does it now (although the connection to the site isn't reliable enough that I've made it to the end yet).

I've also checked out what format is optimal for import into both Omeka and Collective Access, and the answer is the same: XML. So, now I just need to start adding the code that'll do that.

@kwhite2 https://github.com/kwhite2 @riverfr0zen https://github.com/riverfr0zen @timhutchings https://github.com/timhutchings

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/plagmada/plagmada-archives/issues/23# issuecomment-263103410, or mute the thread https://github.com/notifications/unsubscribe-auth/ATbkhFFgA80F_ iP3kPZ3H0rnaiJger9bks5rCRXQgaJpZM4Kfdkm .

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/plagmada/plagmada-archives/issues/23#issuecomment-263152150, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ5F8iQBIqN8qBgTlKbbvD_QvzZrlgy6ks5rCgMKgaJpZM4Kfdkm .

truephredd commented 7 years ago

Checking in today just to keep my productivity tracker happy. It wants 5 minutes. Planning on using tomorrow to do some coding and put in real time.

Currently, as per the slack, I've gotten my script to the point that it can scan through the entire current archive successfully, deal with bad links and print out metadata to the screen. My next step is to start saving images locally and then to take the metadata and save it in a format that can be uploaded by both our target applications (XML).

I've also laid out what's going on in the script in pseudocode (link: https://plagmada.slack.com/files/phredd/F3PRKJLJF/pseudocode.jpg), which also describes what data and metadata I'm targeting.

kwhite2 commented 7 years ago

Excuse the libsplaining if you remember this but just didn't mention it for the sake of brevity:

Before uploading, there is likely going to need to be serious reformatting of the XML according to the Dublin Core schema, since most of the relevant metadata for the digital objects is contained in "notes" fields. It's going to have to be separated out and re-formatted to work well with the metadata fields in the new content management system.

Since there's no way on this green earth I'd be able to spend the time doing this on my own at the moment, I have a thought. My suggestion is that, once we're ready and the metadata is harvested from the old system, I'll sit down, take a look at it, and then we do an edit-a-thon. I'll train everyone and we can all set aside a weekend to sit down and blaze through crosswalking and reformatting the information to ingest into the new system.

Krista


From: Phredd Groves notifications@github.com Sent: Friday, January 13, 2017 12:39:50 PM To: plagmada/plagmada-archives Cc: Krista White; Mention Subject: Re: [plagmada/plagmada-archives] Write Web Scraping Script to Harvest Existing Gallery2 Site (#23)

Checking in today just to keep my productivity tracker happy. It wants 5 minutes. Planning on using tomorrow to do some coding and put in real time.

Currently, as per the slack, I've gotten my script to the point that it can scan through the entire current archive successfully, deal with bad links and print out metadata to the screen. My next step is to start saving images locally and then to take the metadata and save it in a format that can be uploaded by both our target applications (XML).

I've also laid out what's going on in the script in pseudocode (link: https://plagmada.slack.com/files/phredd/F3PRKJLJF/pseudocode.jpg), which also describes what data and metadata I'm targeting.

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/plagmada/plagmada-archives/issues/23#issuecomment-272499144, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ATYpDTbhwlgbTc7a5cKl_Z4cgQH2Hyf0ks5rR7bmgaJpZM4Kfdkm.

truephredd commented 7 years ago

Yep! I knew I'd need help with the eventual massaging of the metadata. Sounds like I should not worry about that for now and just get things going to produce a package with our data and metadata in a format we can all look at.

I can always update the script and take another pass once we have a better idea of how to format things. I like the edit-a-thon proposal!

Cheers!

On Fri, Jan 13, 2017 at 1:38 PM, kwhite2 notifications@github.com wrote:

Excuse the libsplaining if you remember this but just didn't mention it for the sake of brevity:

Before uploading, there is likely going to need to be serious reformatting of the XML according to the Dublin Core schema, since most of the relevant metadata for the digital objects is contained in "notes" fields. It's going to have to be separated out and re-formatted to work well with the metadata fields in the new content management system.

Since there's no way on this green earth I'd be able to spend the time doing this on my own at the moment, I have a thought. My suggestion is that, once we're ready and the metadata is harvested from the old system, I'll sit down, take a look at it, and then we do an edit-a-thon. I'll train everyone and we can all set aside a weekend to sit down and blaze through crosswalking and reformatting the information to ingest into the new system.

Krista


From: Phredd Groves notifications@github.com Sent: Friday, January 13, 2017 12:39:50 PM To: plagmada/plagmada-archives Cc: Krista White; Mention Subject: Re: [plagmada/plagmada-archives] Write Web Scraping Script to Harvest Existing Gallery2 Site (#23)

Checking in today just to keep my productivity tracker happy. It wants 5 minutes. Planning on using tomorrow to do some coding and put in real time.

Currently, as per the slack, I've gotten my script to the point that it can scan through the entire current archive successfully, deal with bad links and print out metadata to the screen. My next step is to start saving images locally and then to take the metadata and save it in a format that can be uploaded by both our target applications (XML).

I've also laid out what's going on in the script in pseudocode (link: https://plagmada.slack.com/files/phredd/F3PRKJLJF/pseudocode.jpg), which also describes what data and metadata I'm targeting.

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ plagmada/plagmada-archives/issues/23#issuecomment-272499144, or mute the threadhttps://github.com/notifications/unsubscribe- auth/ATYpDTbhwlgbTc7a5cKl_Z4cgQH2Hyf0ks5rR7bmgaJpZM4Kfdkm.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/plagmada/plagmada-archives/issues/23#issuecomment-272513280, or mute the thread https://github.com/notifications/unsubscribe-auth/AQ5F8s3AsvsIicBU5l4gS6_XoRowYuyyks5rR8SUgaJpZM4Kfdkm .

timhutchings commented 7 years ago

Neat!

On Fri, Jan 13, 2017 at 11:41 AM, Phredd Groves notifications@github.com wrote:

Yep! I knew I'd need help with the eventual massaging of the metadata. Sounds like I should not worry about that for now and just get things going to produce a package with our data and metadata in a format we can all look at.

I can always update the script and take another pass once we have a better idea of how to format things. I like the edit-a-thon proposal!

Cheers!

On Fri, Jan 13, 2017 at 1:38 PM, kwhite2 notifications@github.com wrote:

Excuse the libsplaining if you remember this but just didn't mention it for the sake of brevity:

Before uploading, there is likely going to need to be serious reformatting of the XML according to the Dublin Core schema, since most of the relevant metadata for the digital objects is contained in "notes" fields. It's going to have to be separated out and re-formatted to work well with the metadata fields in the new content management system.

Since there's no way on this green earth I'd be able to spend the time doing this on my own at the moment, I have a thought. My suggestion is that, once we're ready and the metadata is harvested from the old system, I'll sit down, take a look at it, and then we do an edit-a-thon. I'll train everyone and we can all set aside a weekend to sit down and blaze through crosswalking and reformatting the information to ingest into the new system.

Krista


From: Phredd Groves notifications@github.com Sent: Friday, January 13, 2017 12:39:50 PM To: plagmada/plagmada-archives Cc: Krista White; Mention Subject: Re: [plagmada/plagmada-archives] Write Web Scraping Script to Harvest Existing Gallery2 Site (#23)

Checking in today just to keep my productivity tracker happy. It wants 5 minutes. Planning on using tomorrow to do some coding and put in real time.

Currently, as per the slack, I've gotten my script to the point that it can scan through the entire current archive successfully, deal with bad links and print out metadata to the screen. My next step is to start saving images locally and then to take the metadata and save it in a format that can be uploaded by both our target applications (XML).

I've also laid out what's going on in the script in pseudocode (link: https://plagmada.slack.com/files/phredd/F3PRKJLJF/pseudocode.jpg), which also describes what data and metadata I'm targeting.

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ plagmada/plagmada-archives/issues/23#issuecomment-272499144, or mute the threadhttps://github.com/notifications/unsubscribe- auth/ATYpDTbhwlgbTc7a5cKl_Z4cgQH2Hyf0ks5rR7bmgaJpZM4Kfdkm.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/plagmada/plagmada-archives/issues/23# issuecomment-272513280, or mute the thread https://github.com/notifications/unsubscribe- auth/AQ5F8s3AsvsIicBU5l4gS6_XoRowYuyyks5rR8SUgaJpZM4Kfdkm .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/plagmada/plagmada-archives/issues/23#issuecomment-272528881, or mute the thread https://github.com/notifications/unsubscribe-auth/ATbkhOvMjSqGSfdnbTJLYAWy6YzrbPfAks5rR9N0gaJpZM4Kfdkm .

kwhite2 commented 7 years ago

Yep, just keep on with the script writing. We can worry about metadata once we have some preliminary XML to look at.

Thanks for your work on this Phredd!

Krista

From: timhutchings [mailto:notifications@github.com] Sent: Friday, January 13, 2017 9:35 PM To: plagmada/plagmada-archives plagmada-archives@noreply.github.com Cc: Krista White kwhite2@libraries.rutgers.edu; Mention mention@noreply.github.com Subject: Re: [plagmada/plagmada-archives] Write Web Scraping Script to Harvest Existing Gallery2 Site (#23)

Neat!

On Fri, Jan 13, 2017 at 11:41 AM, Phredd Groves notifications@github.com<mailto:notifications@github.com> wrote:

Yep! I knew I'd need help with the eventual massaging of the metadata. Sounds like I should not worry about that for now and just get things going to produce a package with our data and metadata in a format we can all look at.

I can always update the script and take another pass once we have a better idea of how to format things. I like the edit-a-thon proposal!

Cheers!

On Fri, Jan 13, 2017 at 1:38 PM, kwhite2 notifications@github.com<mailto:notifications@github.com> wrote:

Excuse the libsplaining if you remember this but just didn't mention it for the sake of brevity:

Before uploading, there is likely going to need to be serious reformatting of the XML according to the Dublin Core schema, since most of the relevant metadata for the digital objects is contained in "notes" fields. It's going to have to be separated out and re-formatted to work well with the metadata fields in the new content management system.

Since there's no way on this green earth I'd be able to spend the time doing this on my own at the moment, I have a thought. My suggestion is that, once we're ready and the metadata is harvested from the old system, I'll sit down, take a look at it, and then we do an edit-a-thon. I'll train everyone and we can all set aside a weekend to sit down and blaze through crosswalking and reformatting the information to ingest into the new system.

Krista


From: Phredd Groves notifications@github.com<mailto:notifications@github.com> Sent: Friday, January 13, 2017 12:39:50 PM To: plagmada/plagmada-archives Cc: Krista White; Mention Subject: Re: [plagmada/plagmada-archives] Write Web Scraping Script to Harvest Existing Gallery2 Site (#23)

Checking in today just to keep my productivity tracker happy. It wants 5 minutes. Planning on using tomorrow to do some coding and put in real time.

Currently, as per the slack, I've gotten my script to the point that it can scan through the entire current archive successfully, deal with bad links and print out metadata to the screen. My next step is to start saving images locally and then to take the metadata and save it in a format that can be uploaded by both our target applications (XML).

I've also laid out what's going on in the script in pseudocode (link: https://plagmada.slack.com/files/phredd/F3PRKJLJF/pseudocode.jpg), which also describes what data and metadata I'm targeting.

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ <https://github.com/%0b> > plagmada/plagmada-archives/issues/23#issuecomment-272499144>, or mute the threadhttps://github.com/notifications/unsubscribe- <https://github.com/notifications/unsubscribe-%0b> > auth/ATYpDTbhwlgbTc7a5cKl_Z4cgQH2Hyf0ks5rR7bmgaJpZM4Kfdkm>.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/plagmada/plagmada-archives/issues/23# issuecomment-272513280, or mute the thread https://github.com/notifications/unsubscribe- <https://github.com/notifications/unsubscribe-%0b> auth/AQ5F8s3AsvsIicBU5l4gS6_XoRowYuyyks5rR8SUgaJpZM4Kfdkm> .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/plagmada/plagmada-archives/issues/23#issuecomment-272528881, or mute the thread https://github.com/notifications/unsubscribe-auth/ATbkhOvMjSqGSfdnbTJLYAWy6YzrbPfAks5rR9N0gaJpZM4Kfdkm .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/plagmada/plagmada-archives/issues/23#issuecomment-272595313, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ATYpDQwaIimpP1dA4uMI1xpbdFXZXkw-ks5rSDRogaJpZM4Kfdkm.