Closed ronaldtse closed 4 years ago
P.S. @opoudjis this is for integration into metanorma-unece (we might want to change it to metanorma-un and support the "organization" as "unece")
Hello @ronaldtse, I ran into a problem. Looks like the site https://documents.un.org has protection from scraping. When we open page https://documents.un.org/prod/ods.nsf/home.xsp it set a cookie and redirect to another page with a password in parameters, then redirect back to the page. I've tried to reproduce browser's behavior but it redirects me to a login page. I spent much time on this already. Now I have to do some other tasks. Can return to the work later.
@andrew2net have you tried
?
@ronaldtse the problem isn't in scraping. I can get the page using script. The site detects somehow whether it browser or script and redirect script to the login page. I tried to make script act as a browser, reproduce all details browser does. Maybe I missed something. I'll check it later.
@andrew2net try to use the Mechanize gem, it pretends to be a browser. The Kimura gem goes further by allowing execution of JS within the fake browser. Both ways may be able to go around this issue.
@ronaldtse ok, I'll try it
@ronaldtse the Mechanize gem doesn't solve the problem. The Kimura gem required Ruby version >= 2.5.0. I didn't try it yet.
@ronaldtse I solved the problem and continue implementing the gem.
Than k you @andrew2net !
@ronaldtse pleae help me map data from page to relaton model:
The mappings are good. For the remaining, should we put them as new fields in bibdata?
Session or Year :10 => ? Agenda Item(s): 12 => ? Distribution: GEN => ? Area: UNDOC => ?
Ping @opoudjis
Three of those fields are already modelled in bibdata/ext:
session = bibdata/ext/session/number agenda item = bibdata/ext/session/item-number distribution = bibdata/ext/distribution
Btw, I check notifications once a month, and I monitor my project, which is Metanorma. @ronaldtse and @andrew2net, I reiterate: if you want to draw my attention to any Relaton tasks, you MUST communicate with me on Skype.
The "area" is brand new to me, and i've yet to find a document whose area is not UNDOC. It does not appear on UN documents either. I'd be inclined to shove it in the bibdata/classification field, as <classification type="area">UNDOC</classification>
.
distribution = bibdata/ext/distribution
@opoudjis I've found distribution values GEN, LTD, and DER. In the gramma, distribution is restricted to be { "general" | "limited" | "restricted" }
. Suppose the mapping should be:
GEN => general
LTD => limited
DER => restricted
is it correct?
http://dd.dgacm.org/editorialmanual/EM_Article_H4_487-488.pdf is the source for "general" | "limited" | "restricted"
I do not know where DER comes from; do you have an example?
ACC/1985/PER/R.17 has DER distribution
The editorialgroup
is mandatory for ext
element:
BibDataExtensionType =
doctype?, submissionlanguage*, editorialgroup, ics*, distribution?, session?
What should we put into the editiorialgroup
?
metanorma-un presupposes that there are committees that do things in UN publications, included in the metadata as bibdata/editorialgroup/committee; for example,
United Nations Centre for Trade Facilitation and Electronic Business (UN/CEFACT)
is a committee.
ok, but what should we do if there isn't a committee to parse?
@andrew2net
ACC/1985/PER/R.17 has DER distribution
Can you provide the link? I was able to find https://digitallibrary.un.org/record/751547?ln=en but it just says "Public".
@ronaldtse I search UN documents here by a symbol (ACC/1985/PER/R.17) and got in search result:
ok, but what should we do if there isn't a committee to parse?
Can you provide an example where there isn't a committee? Thanks.
Can you provide an example where there isn't a committee? Thanks.
There isn't a committee in any document. We have only:
The metadata available are:
- Symbol: TRADE/CEFACT/2004/32
- Title: SECRETARIAT REVIEW OF UN/LOCODE, 19 DECEMBER 2003 / SUBMITTED BY THE SECRETARIAT
- Session or Year :10
- Agenda Item(s): 12
- Distribution: GEN
- Area: UNDOC
- Subject(s): TRADE DATA INTERCHANGE, CODES, PORTS, TRANSPORT TERMINALS, TRADE FACILITATION, PROJECT EVALUATION, DATABASES, WEBSITES
- Publication Date: 18/03/2004
- Release Date: 30/03/2004
- Job Number: G0430683
URI:
I think the committee is readable from the Symbol.
Here, the committee is TRADE/CEFACT.
@ronaldtse not all symbols have committee abbreviations in its symbol. For example 0652A(D3)/30A(2)/P.7/ML
. But we can't omit bibdata/editorialgroup/committee
since it is mandatory.
Some symbols have committee abbreviations at the beginning (ACC/1985/PER/R.17
) some at the middle (TRADE/CEFACT/2004/32
). It would be helpful if we have a list of abbreviations.
Is there a list of committees? Like:
ACC => Administrative Committee on Co-ordination
ESCAP => Economic and Social Commission for Asia and Pacific
...
UPD seems I found the committees list in the A/CONF.94/35 document
UPD2 but seems it isn't full list
Ahhh I am not sure what to do here.
In fact, "TRADE/CEFACT" indicates one committee, so it's still the beginning of the document symbol.
Maybe this list is useful: https://research.un.org/en/docs/symbols
https://research.un.org/en/docs/ga/committees
@ronaldtse @opoudjis There is distribution PRO (A/47/PV.102/CORR.1, A/47/PV.54). Looks like it means PROVISIONAL. In the grammar we have:
distribution = element distribution { "general" | "limited" | "restricted" }
I found that:
GEN => general
LTD => limited
DER => restricted
So it seems we need to enhanse the distribution list, don't we?
- Job Number: G0430683 => ?
@ronaldtse @opoudjis Job Number still unmapped. Do we need it?
I will add "provisional" to the grammar, but @ronaldtse, we can't keep reconstructing the grammar of the UN documents from scraps. Someone in the UN needs to talk to us.
Thanks guys. @opoudjis I am uncomfortable escalating this to UN HQ until we have a good level of conformance.
Job Number still unmapped. Do we need it?
Can we map it to a "job_number"? Seems to be meta info to the document.
@ronaldtse I didn't find a job_number
element in the grammars. Do you mean to create a new element?
He does. Add it to ext.
This is the site: https://documents.un.org/
The way to search is through the "UN Symbol", which is the unique UN document number. All UN documents, before circulation, must be registered in the UN ODS (Official Document System) system, and have a unique "Symbol".
For example, the document symbol "TRADE/CEFACT/2004/32" leads to this page:
The metadata available are: