Closed ronaldtse closed 3 years ago
- Create relaton-data-cie, which fetches information from the CIE techstreet.com website and also its own page for non-purchased publications.
@ronaldtse why do we need the relaton-data-cie gem? Why don't we just make the relaton-cie to fetch the information from the CIE? What do you mean "own page for non-purchased publications"?
@andrew2net sorry for the late reply. I suggested relaton-data-cie because it's hard to search identifiers on their site.
It would be easier to have an index at relaton-data-cie and then cache (daily) the information into a way we can easily find the identifiers.
@ronaldtse do we need to scrape only CIE documents like CIE 243:2021
or all documents like IEC 62477-2 Ed. 1.0 b:2018
, ISO/TR 4808:2021
etc.
There are 634 CIE documents on the site.
The number of all documents is 625,781. It may take several days to scrape all of them.
@andrew2net let's only cache the publications on cie.co.at. Only CIE documents. Thanks!
@ronaldtse there are advisors contributors:
which role type should we map to advisor?
role =
element role {
attribute type { ( "author" | "performer" | "publisher" | "editor" | "adapter" | "translator" | "distributor" ) }?,
roledescription*
}
Great find @andrew2net . I think we have to extend the vocabulary to accommodate that.
Ping @opoudjis to find the appropriate role. I think advisor
is acceptable.
@ronaldtse it seems these documents 016 and CIE 016-1970 are identical, aren't they? Should we process them separately? There are about 30 same pairs of documents.
This document doesn't have identification and there is a similar CIE 197:2011 document with only a strange difference in the title.
@andrew2net yes the 016 documents are identical. Can we differentiate them? I noticed they have these different links:
We should prioritise the latter since it's cleaner?
The CIE 197:2011 document links are indeed strange. I wonder if they are using some non-standard way to enter these documents.
@ronaldtse seems this document is just a disc version of some documents. Do we need it in our repo?
@ronaldtse the pages on the cie.co.at are not structured well. Will try https://www.techstreet.com/cie/searches/31156444
@ronaldtse there are about 300 documents without identifier like this one. Should store them in our repo? If yes then how to reference such a document?
That particular paper seems to have a document identifier of "PO38, 616-625".
e.g. this page contains a number of those: https://www.techstreet.com/cie/subgroups/54760
Maybe it can be called "CIE PO38, 616-625"
Maybe it can be called "CIE PO38, 616-625"
@ronaldtse maybe but then there are documents with same reference. For example this and this. It won't be prolem if they were identical but several of them misses some information. The documents I mention above have different authors and one of them lack of description. Maybe we should merge them?
@andrew2net yes this is very strange. I don't know how we can reconcile these entries. For these particular documents, simple merging is also difficult since the author names are encoded differently.
I noticed that some "conference proceeding" document on techstreet has a "product code" that is unique:
x043
seems to be the conference ID, and the code behind is proceeding ID. Would this ID work as the unique key?
Conference proceedings: link
But not all Conference Proceedings have this product code, e.g. this
Ping @opoudjis to find the appropriate role. I think
advisor
is acceptable.
Hold up, hold up, hold up.
role =
element role {
attribute type { ( "author" | "performer" | "publisher" | "editor" | "adapter" | "translator" | "distributor" ) }?,
roledescription*
}
The type will NOT be extended beyond those seven values. That is a modelling decision inherited from ISO-690. Advisors must be entered in roledescription, as a refinement of a type; and the best fit we have to that is editor
. So, <role attribute="editor">advisor</role>
.
@opoudjis okay
@ronaldtse there are documents like CIE S 006.1/E-1998 (ISO 16508:1999). Should we reference them like CIE S 006.1/E-1998? Shouldn't the ISO 16508:1999 be a second document id?
PS Same with CIE S 009 / E:2002 / IEC 62471:2006 Should we reference it as CIE S 009 / E:2002? And shouldn't the IEC 62471:2006 be a second document id?
Yes in these cases they are dual ID documents. If the user cites CIE S 006.1/E-1998
, we render CIE S 006.1/E-1998 (ISO 16508:1999)
.
Here's the actual document preview: https://cdn.standards.iteh.ai/samples/31003/2112810a52ec4b77917c120d7d9741b9/ISO-16508-1999.pdf
Technically this is Dual ID:
In the case of CIE S 009 / E:2002 / IEC 62471:2006
, it is also a dual ID:
https://cdn.standards.iteh.ai/samples/15358/5d30e63eeba94e6ab50031932b0d23a1/IEC-62471-2006.pdf
Notice that the CIE reference number is already different between these two documents:
-1998
vs :2002
(ISO ...)
vs / IEC ...
. This is also reflected in the normative references:
For the primary ID, we should allow all these variations:
/E
: CIE S 006.1/E-1998
, CIE S 009/E:2002
/E
: CIE S 006.1-1998
, CIE S 009:2002
For the second ID, let's use the pattern {primary id} ({secondary id})
.
Note that if the primary ID contains language, the secondary ID should also contain language. Similarly if the primary ID doesn't contain language, the second ID should not contain language.