programminghistorian / jekyll

Jekyll-based static site for The Programming Historian
http://programminghistorian.org
521 stars 228 forks source link

Could perma.cc help PH keep weblinks sustainable? #2030

Closed hawc2 closed 3 years ago

hawc2 commented 3 years ago

I came across perma.cc today and was wondering if it could be useful for the Programming Historian to ensure its weblinks are more 'permanent.'

After chatting with @walshbr and @ZoeLeBlanc I'm opening this issue so we could do some research and see if we should be using it instead of or in conjunction with the web archive

ZoeLeBlanc commented 3 years ago

Matt Lincoln also helpfully sent this article on Robustifying Links To Combat Reference Rot https://journal.code4lib.org/articles/15509 (not tagging Matt so that we don't bug him but also wanna give him kudos).

Definitely think we should discuss this all at our next tech team meeting, which @hawc2 you're welcome to attend

acrymble commented 3 years ago

This ticket needs someone assigned to it. Otherwise it will stay open forever. @hawc2 are you planning on taking this forward?

hawc2 commented 3 years ago

Yeah, I just assigned it to myself, and the plan was @ZoeLeBlanc will bring it up at the meeting this Wednesday. There are some organizational decisions to make, but this seems like a pretty viable and sustainable option, if we can get a sponsor library on board

drjwbaker commented 3 years ago

On perma.cc, having had a look the following people are at institutions that are already partners, though I note that in some cases it may be specific (law) libraries that may not provide support to all faculty.

Signing up is free for academic libraries, so I've asked Sussex as well. My instinct is that if we want to move to perma.cc we need a number of us at institutions where our libraries have signed up. So there are two actions here:

ZoeLeBlanc commented 3 years ago

I'll reach out the Princeton but was planning to see about getting UIUC to join IPP anyways, so will ask about this with them too!

ZoeLeBlanc commented 3 years ago

Just to clarify @hawc2 & @drjwbaker is perma.cc free if a library sponsors us? Or does the library already need to be a member and then we just use it with their account? Mostly just wondering how much this costs the sponsoring library. Thanks!

drjwbaker commented 3 years ago

My read is that it is free for academic libraries to join, and then any faculty can use the account for any purpose. But my scan may be wrong! It may just be worth starting by asking your library about their perma.cc membership and how you can use it.

hawc2 commented 3 years ago

I'm double-checking with a colleague, and can email perma.cc, but my understanding is that a library, say Sussex Library, could become a member for free, and register PH as a journal with the library and perma.cc. Then we could add Organizational Users who work in the journal.

It seems like each of these Organizational Users wouldn't also have to work at a member library of perma.cc, but I'm double checking on that, as it is vague in the documentation. If it is true that each journal editor would need a perma.cc account through their library, I wonder if it would only be necessary for those in charge of doing the perma.cc. links to get library perma.cc access, so we could keep this to a subset of our editorial team.

This page of the User Guide helpful, as is the PDF at the bottom for academic journals: https://perma.cc/docs/libraries

I'll look into Temple Law Library, but like at alot of academic libraries in the U.S. at least, the Law Library is a separate organization from the rest of our Libraries in some strange ways, so it's possible I won't be able to access their account with perma.cc

On Wed, 24 Mar 2021 at 12:22, James Baker @.***> wrote:

My read is that it is free for academic libraries to join, and then any faculty can use the account for any purpose. But my scan may be wrong! It may just be worth starting by asking your library about their perma.cc membership and how you can use it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/programminghistorian/jekyll/issues/2030#issuecomment-805966598, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXF4EFEDSGCKJ6AGYRES6LTFIGS3ANCNFSM4XFIPAXQ .

walshbr commented 3 years ago

That's funny @hawc2 - I didn't realize it was a common thing. But the UVA Law Library is separate institutionally from the rest of our Library, and I similarly would not have access to their account.

hawc2 commented 3 years ago

My ID card doesn't even get me into the Law School/Library building! Pretty sure it's the only building on campus for which that's the case. It is strange.

But maybe our libraries will be more interested in being members with perma.cc if our Law libraries already are?

On Wed, 24 Mar 2021 at 13:46, Brandon Walsh @.***> wrote:

That's funny @hawc2 https://github.com/hawc2 - I didn't realize it was a common thing. But the UVA Law Library is separate institutionally from the rest of our Library, and I similarly would not have access to their account.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/programminghistorian/jekyll/issues/2030#issuecomment-806030997, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXF4EGLPWRQLNTKWH7Y2ULTFIQPHANCNFSM4XFIPAXQ .

drjwbaker commented 3 years ago

Yeah, I guess law libraries at US universities might be separate things, but thought I'd ping you all anyway just in case :)

hawc2 commented 3 years ago

Good news regarding perma.cc.

1) I can access perma.cc through Temple Law Library. @walshbr I'm curious what you'll find out about UVA's Law Library.

2) I heard back from perma.cc and it does sound like as long as one library, such as Sussex, is both an IPP for PH and a member of Perma.cc, then @drjwbaker could create an Org account for PH and add any of us PH editors as Org administrators for PH's perma.cc instance. Even if editors don't have access to perma.cc through their own academic libraries, they can be added as Org administrators and create perma links for PH.

It's possible we could do it with any of our libraries, and that they don't need to be Institutional Partners of PH to use perma.cc for the journal. I've followed up with perma.cc's support team to ask about long-term sustainability in terms of what happens if the relevant staff at the hosting institution were to leave either PH or their academic institution.

hawc2 commented 3 years ago

Update on migrating between institutions, from perma.cc support: "If you'd like to migrate an org from one registrar to another, you would just need to send in that request to the perma team and get permission from both the existing registrar and the intended registrar."

drjwbaker commented 3 years ago

@hawc2 Great digging! Will you reply on our behalf via Temple Law Library (ideally using programminghistorian@gmail.com, though I appreciate you probably don't have access - but you can have it)? Or would you like me to? (if I can via your library)

hawc2 commented 3 years ago

@drjwbaker Do you mean I should set up an Org account for PH through Temple's account?

If I can have access to the gmail account, I'm happy to begin a separate conversation directly with perma.cc user support about the various options we're considering for using their service for the journal.

drjwbaker commented 3 years ago

Okay. I'll email you the gmail details. If you could do it now(ish) I can be sure to approve the login when the big WARNING sign flashes up on my phone :) (google authentication has caused problems before when sharing access)

drjwbaker commented 3 years ago

@hawc2 How are you getting on with this? Need a hand?

hawc2 commented 3 years ago

Temple's Law Library has held up actually creating a Programming Historian account but I'm starting to experiment with our department's blog. I'll try to move things forward on my end and follow up. I could definitely use help evaluating how this would be best tested and used for PH. It's going to be quite time consuming

On Fri, 30 Apr 2021 at 03:09, James Baker @.***> wrote:

@hawc2 https://github.com/hawc2 How are you getting on with this? Need a hand?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/programminghistorian/jekyll/issues/2030#issuecomment-829890827, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXF4EBCZ4IHEHA6P2CHBLTTLJJSBANCNFSM4XFIPAXQ .

drjwbaker commented 3 years ago

Okay.

Too time consuming to be worth it? I guess what we are suggesting here is a) all future articles use perma.cc for link b) when link rot occurs in published articles, perma.cc is used to fix links (that is, we aren't going to go through and make perma.cc links for all published articles)

Right?

hawc2 commented 3 years ago

I got the PH account set up through Temple Law Library now. I can request any PH edit to be added - I just need names and emails. I sent a separate email to you James to discuss.

Agreed our goals are a) and b) first and foremost, but I'm not sure if b) is something you can do retroactively in some cases? Doing it for all published articles might be a good long term goal, but we should see about a) and b) before assessing whether anything more would be worth the time.

Do we have any sense or data on how much link rot currently affects PH tutorials?

On Tue, 4 May 2021 at 03:46, James Baker @.***> wrote:

Okay.

Too time consuming to be worth it? I guess what we are suggesting here is a) all future articles use perma.cc for link b) when link rot occurs in published articles, perma.cc is used to fix links (that is, we aren't going to go through and make perma.cc links for all published articles)

Right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/programminghistorian/jekyll/issues/2030#issuecomment-831747276, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXF4EFBS6H5ATCX6Z263KTTL6Q5RANCNFSM4XFIPAXQ .

ZoeLeBlanc commented 3 years ago

Thanks for getting this setup Alex 👏🏽 !

No set number on how often this happen, but I do think it's easily once every month or so that we find a broken link for various reasons.

I agree that focusing on future and current breaking links is the right direction and that we can over time move all lessons to using perma.cc. I think an additional next step is writing up documentation for editors to use perma.cc. Right now our tech documentation is long and not broken up easily by topic, so I would recommend potentially starting a new page for fixing broken links and we can work on archiving the existing instructions.

Let me know if you need help with this Alex and thanks again for taking the lead on this 🙌🏽

drjwbaker commented 3 years ago

Two thoughts (that came up in an email thread with @hawc2 ):

1) we should align this with @rivaquiroga's Lesson Maintenance Workflow https://github.com/programminghistorian/jekyll/issues/2058

2) maybe the correct approach here to ask MEs for a list of active editors who need accounts? (and then add getting an account to the onboarding process) Alternatively, might editors log into perma.cc via the gmail?

drjwbaker commented 3 years ago

And thanks @ZoeLeBlanc for contributing. I'm aware that it is often @programminghistorian/technical-team members who resolve issues with broken links.

hawc2 commented 3 years ago

Update on for providing access to perma.cc.: all PH members can now access our perma.cc account through our programminghistorian@gmail.com account. @drjwbaker has the account access info.

Agree with @ZoeLeBlanc we should create documentation this summer for using perma.cc.

For now, we'll plan to test it out on specific broken links?

I'm happy to help lead the effort but will need some onboarding to how we're handling the problem currently - makes sense to integrate this with @rivaquiroga Lesson Maintenance Worfklow to me

drjwbaker commented 3 years ago

I've added this to the service integrations https://github.com/programminghistorian/jekyll/wiki/Service-Integrations

acrymble commented 3 years ago

I wanted to check about perma.cc

Does it mean we no longer link to the live web but instead to archived snapshots? If so there are potential problems for links to frequently updated pages (e.g. wikipedia) that we have used to always provide the most recent knowledge, or to software download sites (again, pointing to the latest version rather than the page at time of publication).

If so then I think we need to integrate this with the copyediting stage to make sure only appropriate links are perma.cc'd and we need to update the copyediting guidance to provide clarity.

hawc2 commented 3 years ago

That's a great point @acrymble. In other words, perma.cc may help deal with "link rot," but it could pose an obstacle to PH taking advantage of "link growth"

drjwbaker commented 3 years ago

Is the making progress @hawc2? (and, do we know the steps that look like progress?)

hawc2 commented 3 years ago

Now that we have general access to use it, could we have a meeting to discuss how to proceed, both with testing and implementation?

I am still learning the ropes of some PH processes, so I'm not sure who should be involved and what are the most efficient ways to integrate perma.cc into our workflows.

It shouldn't be a hard tool to use, but as @acrymble mentioned, there are some complex decisions to consider, and it will be very time-consuming to remediate old lessons.

drjwbaker commented 3 years ago

It feels like the aim is to get it into the author/editor guidelines as our preferred implementation of URLs where we do not expect the content at those URLs to change / be usefully dynamic (as @acrymble notes). A route to implementation might be to test this with a live article submission (perhaps one you edit?) but that decision is better made by a Managing Editor than me. Perhaps we can add this as a discussion point for our next Project Team Call: @mariajoafana will this be in July?

anisa-hawes commented 3 years ago

Hello @hawc2. I'd like to be part of this conversation!

hawc2 commented 3 years ago

Per our team meeting discussion on July 28 #2159, @Anisa-ProgHist and I will test out perma.cc for a PH lesson using the one I just finished editing, currently under copyedit stage with Anisa, issue #325 in ph-submissions: https://programminghistorian.github.io/ph-submissions/lessons/clustering-with-scikit-learn-in-python.

As we finalize this lesson for publication, we'll try to develop some basic standards for use of perma.cc to deal with link 'rot' and 'growth' for further editors. We'll also track how long the process takes us.

While the copyediting stage makes most sense for integrating perma.cc, decisions still need to be made about who will do this labor regularly going forward.

anisa-hawes commented 3 years ago

As part of our pilot implementation of perma.cc on Submission #325, I have collated a list of all links which appear in the lesson (numbers represent the line/paragraph where the links appear).

This list includes links featured in tables, links referenced within code, and links in footnotes.

Note to self: I would be interested to know if the number of links included here (~75) is roughly representative of a 'standard' lesson.

LINKS

PLUS, ADDITIONAL LINKS SUGGESTED BY COPY-EDITOR @Anisa-ProgHist

anisa-hawes commented 3 years ago

Another example which I think may be useful to consider: Submission #348

LINKS

Additional links suggested by copyeditor @Anisa-ProgHist

anisa-hawes commented 3 years ago

Thinking about citations, and wondering whether it would be useful to include both the original URL and the perma.cc URL in our bibliographies/footnotes.

e.g., http://ceur-ws.org/Vol-2253/paper22.pdf archived at https://perma.cc/---

Looking at the recently published lesson Detecting Text Reuse with Passim, I notice that the citation format used doesn't expose the original URL, rather embeds it within the word 'Link'.

Greta Franzini, Maria Moritz, Marco Büchler, Marco Passarotti. Using and evaluating TRACER for an Index fontium computatus of the Summa contra Gentiles of Thomas Aquinas. In Proceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018). (2018). Link

Going forward, I feel that especially when a link isn't archived at perma.cc, it is useful if we can expose original URLs (this may include considering a system for truncating excessively long URLs/those which include queries) because URLs give readers information about sources.

Q: Are we still aiming to use the Chicago Manual of Style format as our template?

anisa-hawes commented 3 years ago

Also, this guide looks useful: https://guides.law.stanford.edu/c.php?g=588091&p=4063422

It shows how it is possible to 'batch create' links, and organise links within folders. Both these features will be useful to us.

PDFs can also be archived. This could be useful for an example such as that given above (http://ceur-ws.org/Vol-2253/paper22.pdf) of conference proceedings which don't have a DOI.

anisa-hawes commented 3 years ago

I suspect that Submission #348 is an unusual case, but it does raise some interesting challenges.

It included several links to the interactive games which are currently playable on the live web. Perma.cc cannot effectively render this kind of complex content, so upon following the link I think readers would be dissatisfied. However, readers could choose to either click through the to ‘See the Screenshot View’ to see a page that looks like the original webpage, or click through to ‘View the Live Page’ from where they will be able to get started playing the game(s) for as long as it/they exist(s) on the web.

In case anyone following this thread is interested, those instances are as follows:

Interestingly,

Links to YouTube playlists are also problematic. The page ‘looks’ right, but each individual video has a unique URL (in fact, they have multiple URLs, depending upon whether the Playlist is played through start to finish, or if a video is selected individually)

hawc2 commented 3 years ago

Thanks Anisa, this is just what we were hoping to test. The YouTube issue is more surprising than itch.io games. I can reach out to perma.cc to hear their perspective on the problem of archiving dynamic content and emulation systems. I also might consult a couple scholars who work in archiving digital games.

This article suggests webrecorder may succeed in some cases where perma.cc has not: https://blogs.bl.uk/webarchive/2019/03/archiving-interactive-fiction.html. Want to test it out?

The ability to move to the live link through perma.cc makes it still a viable option. But once we've identified any other discrepancies with specific lesson types, we should have a discussion with the managing editors about whether perma.cc's rendering of dynamic pages is too cumbersome for readers for us to use it in those cases.

On Sun, Sep 5, 2021, 9:10 AM Anisa Hawes @.***> wrote:

I suspect that Submission #348 https://github.com/programminghistorian/ph-submissions/issues/348 is an unusual case, but it does raise some interesting challenges.

It included several links to the interactive games which are currently playable on the live web. Perma.cc cannot effectively render this kind of complex content, so upon following the link I think readers would be dissatisfied. However, readers could choose to either click through the to ‘See the Screenshot View’ to see a page that looks like the original webpage, or click through to ‘View the Live Page’ from where they will be able to get started playing the game(s) for as long as it/they exist(s) on the web.

In case anyone following this thread is interested, those instances are as follows:

Interestingly,

Links to YouTube playlists are also problematic. The page ‘looks’ right, but each individual video has a unique URL (in fact, they have multiple URLs, depending upon whether the Playlist is played through start to finish, or if a video is selected individually)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/programminghistorian/jekyll/issues/2030#issuecomment-913152336, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADXF4EH63IWR4ZAPZUUQQOLUANT45ANCNFSM4XFIPAXQ .

anisa-hawes commented 3 years ago

Ah! Yes ! I almost included in my previous comment, that when I am not at PH, I am a freelance web archivist and I use Webrecorder daily ! It is my tool of choice: brilliantly powerful. Definitely capable of capturing these interactive games - I have tested it to archive several, similarly complex, sites/artefacts in the past. I know the web archivists at the British Library very well, including those involved in the Collecting Interactive Digital Narratives project, and those who launched the research that became the Emerging Formats initiative. Capturing individual YouTube videos via their canonical URLs works well, and it is also possible to capture YT embeds on other websites, but Playlists pose particular challenges because of the number of URLs associated with each individual video (can be 10 or more). I would be happy share some examples and more information.

anisa-hawes commented 3 years ago

The developers of Webrecorder are among my direct contacts, and I'd be delighted to chat with them about our use case ✨

drjwbaker commented 3 years ago

Per @anisa-hawes @hawc2 introduction at https://github.com/programminghistorian/jekyll/issues/2223 given the labour involved in using perma.cc is there a case with future new articles for a) encouraging authors to only include essential links, b) discouraging authors from pointing to complicated links (e.g. YouTube playlists). Both these can be justified under our sustainability criteria https://programminghistorian.org/en/reviewer-guidelines#sustainability

hawc2 commented 3 years ago

@anisa-hawes how much additional time would you say perma.cc linking added to the copyedit stage? given that was your first time, how much faster do you think it could become?

@drjwbaker the perma.cc process definitely made it apparent a number of ways we could clarify guidelines for authors/editors on when to use links and what kind. Reducing links overall isn't a bad idea, and we could ask people to avoid some kinds of unnecessary links to dynamic sites. But I don't think the jury is out on our ability to preserve interactive media like games, so I think we should investigate further first

anisa-hawes commented 3 years ago

Here is a brief summary of what I said (although did not express as clearly as I would have liked) at today's Project Team Meeting:

anisa-hawes commented 3 years ago

Per @anisa-hawes @hawc2 introduction at #2223 given the labour involved in using perma.cc is there a case with future new articles for a) encouraging authors to only include essential links, b) discouraging authors from pointing to complicated links (e.g. YouTube playlists). Both these can be justified under our sustainability criteria https://programminghistorian.org/en/reviewer-guidelines#sustainability

Yes, I think this is something we could consider... In one of the two lessons I read, I found that the author had doubled up on links multiple times, rather than defining it/providing a link upon first mention only. Elsewhere, in that lesson I found myself suggesting additional links to define technical terms. I wonder how typical these two lessons were in terms of the number of links they included?

drjwbaker commented 3 years ago

Thanks for the summary @anisa-hawes. I think..

I think this work could contribute to improving the sustainability of our lessons. During the course of collating the links to archive for the two lessons I tested this on, I identified several already broken and was able to liaise with authors to find alternatives ahead of publication.

..is ultimately the key positive. So long as we have an infrastructure where updates fail because a link on another part of the site has gone down, perma.cc has the advantage of reducing our exposure to that, thus gradually making working with the site much easier.

drjwbaker commented 3 years ago

Per @anisa-hawes @hawc2 introduction at #2223 given the labour involved in using perma.cc is there a case with future new articles for a) encouraging authors to only include essential links, b) discouraging authors from pointing to complicated links (e.g. YouTube playlists). Both these can be justified under our sustainability criteria https://programminghistorian.org/en/reviewer-guidelines#sustainability

Yes, I think this is something we could consider... In one of the two lessons I read, I found that the author had doubled up on links multiple times, rather than defining it/providing a link upon first mention only. Elsewhere, in that lesson I found myself suggesting additional links to define technical terms. I wonder how typical these two lessons were in terms of the number of links they included?

Personally, I think some authors use links in our articles as they would be blog rather than a journal, because we've always encourged it, and are now seeing the downside as links break and cause work. Now, I don't want to encourage the inflexibility of journal policies towards links/urls, but I think we could advise more parsimonious use of links and/or a use of links that is clearly justifiable/justified.

anisa-hawes commented 3 years ago

@anisa-hawes how much additional time would you say perma.cc linking added to the copyedit stage? given that was your first time, how much faster do you think it could become?

@drjwbaker the perma.cc process definitely made it apparent a number of ways we could clarify guidelines for authors/editors on when to use links and what kind. Reducing links overall isn't a bad idea, and we could ask people to avoid some kinds of unnecessary links to dynamic sites. But I don't think the jury is out on our ability to preserve interactive media like games, so I think we should investigate further first

I estimate that it added another couple of hours to copyediting, but it felt worthwhile for the reasons explained above. But, you are right to observe that the process can be speeded up as I become more familiar with the workflow.

I'm not certain how often authors link out to YouTube Playlists / individual videos or exceptionally complex content (e.g. the interactive narratives), but I think it's good if we have a workflow in place for if they do – because this content isn't robust. Indeed, the author of the interactive narratives commented on their instability.

anisa-hawes commented 3 years ago

That's an interesting thought, @drjwbaker. Thank you!

anisa-hawes commented 3 years ago

In another recent Issue, we were talking about updates to the research/investigacion/recherche/pesquisa pages. I note that links on these pages break frequently. Perhaps these are good candidates for perma.cc overhauls too!

anisa-hawes commented 3 years ago

I am currently finalising a draft of revised Editorial Guidelines (to be tested in an Onboarding pilot study with the English team this autumn) which include detailed steps for the Copyediting phase of the workflow. My draft integrates step-by-step instructions for link archiving using perma.cc, but recognises that it doesn't have to be the same person who undertakes both tasks. For example, I could perform the link archiving task across all four languages.

Going forwards, I think we could consider integrating use of Webrecorder tools to stabilise (and ensure sustainable access to) the kinds of complex online content (interactives, video 3D models, etc.) we are likely to encounter more frequently in the future. I've added this as an idea for one of our Longer-term Goals within our shared planning document.

Following this successful pilot study, I am closing this Issue.