relaton / relaton-cie

Relaton for CIE documents
BSD 2-Clause "Simplified" License
1 stars 0 forks source link

Implement Relaton CIE #1

Closed ronaldtse closed 3 years ago

ronaldtse commented 3 years ago
andrew2net commented 3 years ago
  • Create relaton-data-cie, which fetches information from the CIE techstreet.com website and also its own page for non-purchased publications.

@ronaldtse why do we need the relaton-data-cie gem? Why don't we just make the relaton-cie to fetch the information from the CIE? What do you mean "own page for non-purchased publications"?

ronaldtse commented 3 years ago

@andrew2net sorry for the late reply. I suggested relaton-data-cie because it's hard to search identifiers on their site.

It would be easier to have an index at relaton-data-cie and then cache (daily) the information into a way we can easily find the identifiers.

andrew2net commented 3 years ago

@ronaldtse do we need to scrape only CIE documents like CIE 243:2021 or all documents like IEC 62477-2 Ed. 1.0 b:2018, ISO/TR 4808:2021 etc. There are 634 CIE documents on the site. The number of all documents is 625,781. It may take several days to scrape all of them.

ronaldtse commented 3 years ago

@andrew2net let's only cache the publications on cie.co.at. Only CIE documents. Thanks!

andrew2net commented 3 years ago

@ronaldtse there are advisors contributors:

image

which role type should we map to advisor?

role =
  element role {
    attribute type { ( "author" | "performer" | "publisher" | "editor" | "adapter" | "translator" | "distributor" ) }?,
    roledescription*
}
ronaldtse commented 3 years ago

Great find @andrew2net . I think we have to extend the vocabulary to accommodate that.

ronaldtse commented 3 years ago

Ping @opoudjis to find the appropriate role. I think advisor is acceptable.

andrew2net commented 3 years ago

@ronaldtse it seems these documents 016 and CIE 016-1970 are identical, aren't they? Should we process them separately? There are about 30 same pairs of documents.

This document doesn't have identification and there is a similar CIE 197:2011 document with only a strange difference in the title.

ronaldtse commented 3 years ago

@andrew2net yes the 016 documents are identical. Can we differentiate them? I noticed they have these different links:

We should prioritise the latter since it's cleaner?

The CIE 197:2011 document links are indeed strange. I wonder if they are using some non-standard way to enter these documents.

andrew2net commented 3 years ago

@ronaldtse seems this document is just a disc version of some documents. Do we need it in our repo?

andrew2net commented 3 years ago

@ronaldtse the pages on the cie.co.at are not structured well. Will try https://www.techstreet.com/cie/searches/31156444

andrew2net commented 3 years ago

@ronaldtse there are about 300 documents without identifier like this one. Should store them in our repo? If yes then how to reference such a document?

ronaldtse commented 3 years ago

That particular paper seems to have a document identifier of "PO38, 616-625".

Screenshot 2021-03-03 at 11 27 47 AM

e.g. this page contains a number of those: https://www.techstreet.com/cie/subgroups/54760

ronaldtse commented 3 years ago

Maybe it can be called "CIE PO38, 616-625"

andrew2net commented 3 years ago

Maybe it can be called "CIE PO38, 616-625"

@ronaldtse maybe but then there are documents with same reference. For example this and this. It won't be prolem if they were identical but several of them misses some information. The documents I mention above have different authors and one of them lack of description. Maybe we should merge them?

ronaldtse commented 3 years ago

@andrew2net yes this is very strange. I don't know how we can reconcile these entries. For these particular documents, simple merging is also difficult since the author names are encoded differently.

I noticed that some "conference proceeding" document on techstreet has a "product code" that is unique:

First link:

Screenshot 2021-03-05 at 10 52 08 AM

Second link:

Screenshot 2021-03-05 at 10 52 45 AM

x043 seems to be the conference ID, and the code behind is proceeding ID. Would this ID work as the unique key?

Conference proceedings: link

But not all Conference Proceedings have this product code, e.g. this

Screenshot 2021-03-05 at 10 56 25 AM
opoudjis commented 3 years ago

Ping @opoudjis to find the appropriate role. I think advisor is acceptable.

Hold up, hold up, hold up.

role =
  element role {
    attribute type { ( "author" | "performer" | "publisher" | "editor" | "adapter" | "translator" | "distributor" ) }?,
    roledescription*
}

The type will NOT be extended beyond those seven values. That is a modelling decision inherited from ISO-690. Advisors must be entered in roledescription, as a refinement of a type; and the best fit we have to that is editor. So, <role attribute="editor">advisor</role>.

ronaldtse commented 3 years ago

@opoudjis okay

andrew2net commented 3 years ago

@ronaldtse there are documents like CIE S 006.1/E-1998 (ISO 16508:1999). Should we reference them like CIE S 006.1/E-1998? Shouldn't the ISO 16508:1999 be a second document id?

PS Same with CIE S 009 / E:2002 / IEC 62471:2006 Should we reference it as CIE S 009 / E:2002? And shouldn't the IEC 62471:2006 be a second document id?

ronaldtse commented 3 years ago

Yes in these cases they are dual ID documents. If the user cites CIE S 006.1/E-1998, we render CIE S 006.1/E-1998 (ISO 16508:1999).

Here's the actual document preview: https://cdn.standards.iteh.ai/samples/31003/2112810a52ec4b77917c120d7d9741b9/ISO-16508-1999.pdf

Technically this is Dual ID:

Screenshot 2021-03-22 at 9 52 02 PM

In the case of CIE S 009 / E:2002 / IEC 62471:2006, it is also a dual ID: https://cdn.standards.iteh.ai/samples/15358/5d30e63eeba94e6ab50031932b0d23a1/IEC-62471-2006.pdf

Screenshot 2021-03-22 at 9 53 42 PM

Notice that the CIE reference number is already different between these two documents:

This is also reflected in the normative references:

For the primary ID, we should allow all these variations:

For the second ID, let's use the pattern {primary id} ({secondary id}).

Note that if the primary ID contains language, the secondary ID should also contain language. Similarly if the primary ID doesn't contain language, the second ID should not contain language.