relaton / relaton-un

UnBib: retrieve UN documents for bibliographic use using the BibliographicItem model
MIT License
1 stars 0 forks source link

Implement relaton-un for UN documents #1

Closed ronaldtse closed 4 years ago

ronaldtse commented 5 years ago

This is the site: https://documents.un.org/

The way to search is through the "UN Symbol", which is the unique UN document number. All UN documents, before circulation, must be registered in the UN ODS (Official Document System) system, and have a unique "Symbol".

For example, the document symbol "TRADE/CEFACT/2004/32" leads to this page:

Screen Shot 2019-03-10 at 10 54 54 AM

The metadata available are:

ronaldtse commented 5 years ago

P.S. @opoudjis this is for integration into metanorma-unece (we might want to change it to metanorma-un and support the "organization" as "unece")

andrew2net commented 5 years ago

Hello @ronaldtse, I ran into a problem. Looks like the site https://documents.un.org has protection from scraping. When we open page https://documents.un.org/prod/ods.nsf/home.xsp it set a cookie and redirect to another page with a password in parameters, then redirect back to the page. I've tried to reproduce browser's behavior but it redirects me to a login page. I spent much time on this already. Now I have to do some other tasks. Can return to the work later.

ronaldtse commented 5 years ago

@andrew2net have you tried

?

andrew2net commented 5 years ago

@ronaldtse the problem isn't in scraping. I can get the page using script. The site detects somehow whether it browser or script and redirect script to the login page. I tried to make script act as a browser, reproduce all details browser does. Maybe I missed something. I'll check it later.

ronaldtse commented 5 years ago

@andrew2net try to use the Mechanize gem, it pretends to be a browser. The Kimura gem goes further by allowing execution of JS within the fake browser. Both ways may be able to go around this issue.

andrew2net commented 5 years ago

@ronaldtse ok, I'll try it

andrew2net commented 5 years ago

@ronaldtse the Mechanize gem doesn't solve the problem. The Kimura gem required Ruby version >= 2.5.0. I didn't try it yet.

andrew2net commented 5 years ago

@ronaldtse I solved the problem and continue implementing the gem.

ronaldtse commented 5 years ago

Than k you @andrew2net !

andrew2net commented 4 years ago

@ronaldtse pleae help me map data from page to relaton model:

ronaldtse commented 4 years ago

The mappings are good. For the remaining, should we put them as new fields in bibdata?

Session or Year :10 => ? Agenda Item(s): 12 => ? Distribution: GEN => ? Area: UNDOC => ?

Ping @opoudjis

opoudjis commented 4 years ago

Three of those fields are already modelled in bibdata/ext:

session = bibdata/ext/session/number agenda item = bibdata/ext/session/item-number distribution = bibdata/ext/distribution

opoudjis commented 4 years ago

Btw, I check notifications once a month, and I monitor my project, which is Metanorma. @ronaldtse and @andrew2net, I reiterate: if you want to draw my attention to any Relaton tasks, you MUST communicate with me on Skype.

opoudjis commented 4 years ago

The "area" is brand new to me, and i've yet to find a document whose area is not UNDOC. It does not appear on UN documents either. I'd be inclined to shove it in the bibdata/classification field, as <classification type="area">UNDOC</classification>.

andrew2net commented 4 years ago

distribution = bibdata/ext/distribution

@opoudjis I've found distribution values GEN, LTD, and DER. In the gramma, distribution is restricted to be { "general" | "limited" | "restricted" }. Suppose the mapping should be: GEN => general LTD => limited DER => restricted is it correct?

opoudjis commented 4 years ago

http://dd.dgacm.org/editorialmanual/EM_Article_H4_487-488.pdf is the source for "general" | "limited" | "restricted"

I do not know where DER comes from; do you have an example?

andrew2net commented 4 years ago

ACC/1985/PER/R.17 has DER distribution

image

andrew2net commented 4 years ago

The editorialgroup is mandatory for ext element:

BibDataExtensionType =
    doctype?, submissionlanguage*, editorialgroup, ics*, distribution?, session?

What should we put into the editiorialgroup?

opoudjis commented 4 years ago

metanorma-un presupposes that there are committees that do things in UN publications, included in the metadata as bibdata/editorialgroup/committee; for example,

United Nations Centre for Trade Facilitation and Electronic Business (UN/CEFACT)

is a committee.

andrew2net commented 4 years ago

ok, but what should we do if there isn't a committee to parse?

ronaldtse commented 4 years ago

@andrew2net

ACC/1985/PER/R.17 has DER distribution

Can you provide the link? I was able to find https://digitallibrary.un.org/record/751547?ln=en but it just says "Public".

andrew2net commented 4 years ago

@ronaldtse I search UN documents here by a symbol (ACC/1985/PER/R.17) and got in search result:

image

ronaldtse commented 4 years ago

ok, but what should we do if there isn't a committee to parse?

Can you provide an example where there isn't a committee? Thanks.

andrew2net commented 4 years ago

Can you provide an example where there isn't a committee? Thanks.

There isn't a committee in any document. We have only:

The metadata available are:

ronaldtse commented 4 years ago

I think the committee is readable from the Symbol.

Here, the committee is TRADE/CEFACT.

andrew2net commented 4 years ago

@ronaldtse not all symbols have committee abbreviations in its symbol. For example 0652A(D3)/30A(2)/P.7/ML. But we can't omit bibdata/editorialgroup/committee since it is mandatory. Some symbols have committee abbreviations at the beginning (ACC/1985/PER/R.17) some at the middle (TRADE/CEFACT/2004/32). It would be helpful if we have a list of abbreviations. Is there a list of committees? Like:

ACC   => Administrative Committee on Co-ordination
ESCAP => Economic and Social Commission for Asia and Pacific
...

UPD seems I found the committees list in the A/CONF.94/35 document

UPD2 but seems it isn't full list

ronaldtse commented 4 years ago

Ahhh I am not sure what to do here.

In fact, "TRADE/CEFACT" indicates one committee, so it's still the beginning of the document symbol.

Maybe this list is useful: https://research.un.org/en/docs/symbols

https://research.un.org/en/docs/ga/committees

https://en.wikipedia.org/wiki/United_Nations_Document_Codes

https://libguides.drew.edu/UNDocs/classification

andrew2net commented 4 years ago

@ronaldtse @opoudjis There is distribution PRO (A/47/PV.102/CORR.1, A/47/PV.54). Looks like it means PROVISIONAL. In the grammar we have:

distribution = element distribution { "general" | "limited" | "restricted" }

I found that:

GEN => general
LTD => limited
DER => restricted

So it seems we need to enhanse the distribution list, don't we?

andrew2net commented 4 years ago
  • Job Number: G0430683 => ?

@ronaldtse @opoudjis Job Number still unmapped. Do we need it?

opoudjis commented 4 years ago

I will add "provisional" to the grammar, but @ronaldtse, we can't keep reconstructing the grammar of the UN documents from scraps. Someone in the UN needs to talk to us.

ronaldtse commented 4 years ago

Thanks guys. @opoudjis I am uncomfortable escalating this to UN HQ until we have a good level of conformance.

ronaldtse commented 4 years ago

Job Number still unmapped. Do we need it?

Can we map it to a "job_number"? Seems to be meta info to the document.

andrew2net commented 4 years ago

@ronaldtse I didn't find a job_number element in the grammars. Do you mean to create a new element?

opoudjis commented 4 years ago

He does. Add it to ext.