openzim / zim-requests

Want a new ZIM file? Propose ZIM content improvements or fixes? Here you are!
https://farm.openzim.org
38 stars 2 forks source link

New request: marc.info #1080

Open vitaly-zdanevich opened 3 months ago

vitaly-zdanevich commented 3 months ago
Popolechien commented 3 months ago

Hey @vitaly-zdanevich this looks like an email archive. What would be the use case here? (as in, what could someone possibly do with these, except nurture some form of nostalgia? is there an educational angle?)

vitaly-zdanevich commented 3 months ago

is there an educational angle?

Yes - reading about some computer problem and how to fix it.

RavanJAltaie commented 2 months ago

`First of all, we respect the 'x-no-archive' mail header ala DejaNews -- if your mail message includes a 'x-no-archive: yes' header it will be dropped from our feed, to respect the wishes of list members who wish to keep their posts private. Since some mail clients still do not allow adding arbitrary mail headers, we've recently adopted what Deja (now Google) added when we weren't looking: we also check for 'X-No-Archive: yes' in the first line of the body of mails. Please email us if you find a message in our database posted by you with the 'x-no-archive' header set. (Note to list admins: the ezmlm list management software gratuitously adds 'X-No-Archive: yes' to every mail that passes through the list by default. It's not possible to tell if the mail was originally sent with that header set or not. That'll prevent any ezmlm'd message from appearing in MARC, until/unless the default list config is changed by an administrator.)

Second, any database view that includes multiple messages (viewing threads, browsing lists, or browsing the results of a search) will show only the real name of the sender, or, failing that, the username of the sender with the @domain.com stripped. So, index pages cannot be pulled and parsed by address-trolling robots. Such a robot would have to pull every individual message to obtain a list of addresses.

More recently, other archives have started stripping the names and addresses off of posts entirely. That starts to get into some sticky copyright issues--by removing attribution entirely you are no longer giving credit where credit is due--so it's a difficult balance. That's an easy step for list-admins to take themselves--they "own" the contents of their lists, and list-members agree to whatever their terms & conditions are by using their list--but since we're a third party and don't have the explicit or implicit permission of each poster to remove their attributions, we can't do that. However, I have implemented some address munging which obfuscates, but does not destroy or remove, the original poster's adress--at signs and periods are replaced with ' () ' and ' ! ' respectively. Of course most spammers will evolve to handle these eventually... We still do not tamper with message bodies, however, since that might break patches, PGP signatures, etc.

Third, we only archive and make available mailing lists whose contents are already public, such as available at at least one other site on the net, or which have an open policy. Most of our archives are totally unofficial, and as such, when in doubt, we first verify that the list maintainers have allowed at least one other site to carry archives of messages before we make them available to the rest of the world, to respect the wishes of list administrators who wish to keep their lists private. List administrators are invited to email us requests to cease archiving their list and/or making the archive publically accessable if they wish for any reason. We also welcome submissions of "blurbs" to go along with a particular list's archive describing the topic of the list, its home, maintainers, etc (something we're not very good about keeping up to date for all the lists, left to our own devices).

We are very, very, very reluctant to make any changes to database-contents once a message comes in. We've received threats from clueless companies' lawyers because of archived bugtraq posts pointing out security flaws, for example. If we honor occasional "oops I didn't mean to post that" mails, we would be editing content, and those clueless lawyers might have a leg to stand on. As a result, our position is that we will only remove a message for one of these reasons:

-A list admin asks us to remove a private list we've accidentally made public archives for, in which case, poof, the whole list is gone (after we are sufficiently sure it's really the list admin requesting it, and not a forged mail, etc). -On request from an original poster (that we can come close to verifying, given that on the Internet everyone is a dog) that is agreed-to by the list admin/owner. -A message is clearly, without a doubt, useless spam that got through a list's (or our) filters. -A court orders us to remove a message.`

I think this will raise so many problems as the content is not fully checked (on progress) for any potential copyright issues.

@Popolechien what do you think?

Popolechien commented 2 months ago

Looks good to me. As they say the content is freely available and they remove whatever is being asked to remove. We are just a glorified mirror at this stage.

RavanJAltaie commented 1 month ago

Recipe created https://farm.openzim.org/recipes/marc.info_en_all I'll update the library link once ready

benoit74 commented 1 month ago

As of 2014-04-02, the MARC archive has 70 million emails

This means 70M pages to scrape with zimit ; this is not feasible with zimit, at least with current state. As a rule of thumb 1M pages is already taking lots of time.

If you want to crawl MARC, please be sure you have a delay between requests, say one or two seconds.

I don't see such a delay in the recipe.

For both reasons, I've disabled the recipe and cancelled current task.

RavanJAltaie commented 1 month ago

I'll tag the request as need scraper for now.