relaton / relaton-iso

RelatonIso: ISO Standards metadata using the BibliographicItem model
BSD 2-Clause "Simplified" License
2 stars 1 forks source link

Building a static dataset at relaton-data-iso via ISO RSS #144

Closed ronaldtse closed 6 months ago

ronaldtse commented 1 year ago

Due to the slow retrieval from ISO's site we have to build a static dataset.

The method to build the static dataset is through ISO's RSS feeds.

For each ICS code, they have an RSS feed:

An ICS RSS feed provides a list of standards like this:

<item>
  <title><![CDATA[ISO/R 860:1968 - International unification of concepts and terms]]></title>
  <link>https://www.iso.org/cms/render/live/en/sites/isoorg/contents/data/standard/00/52/5239.html</link>
  <guid>https://www.iso.org/cms/render/live/en/sites/isoorg/contents/data/standard/00/52/5239.html</guid>
  <description>
    <![CDATA[This document reached stage 95.99 on 1996-06-20, TC/SC: ISO/TC 37, ICS: 01.020]]>
  </description>
  <pubDate>1996-06-20</pubDate>
</item><item>
  <title><![CDATA[ISO/R 1087:1969 - Vocabulary of terminology]]></title>
  <link>https://www.iso.org/cms/render/live/en/sites/isoorg/contents/data/standard/00/55/5590.html</link>
  <guid>https://www.iso.org/cms/render/live/en/sites/isoorg/contents/data/standard/00/55/5590.html</guid>
  <description>
    <![CDATA[This document reached stage 95.99 on 1990-04-01, TC/SC: ISO/TC 37, ICS: 01.020; 01.040.01]]>
  </description>
  <pubDate>1990-04-01</pubDate>
</item>

Steps to retrieve standards

  1. Use the top-level ICS RSS feeds to obtain a full listing of all published standards
  2. For each published standard, parse the individual pages individually for stages and stage dates.

This way we can enumerate all published standards and the latest stages/dates.

Detecting updates

From the daily retrieval of ICS RSS feeds, we can detect if there are any changes to the documents as the RSS feeds provide the latest publication/stage dates. For items that have been updated, we can update using their individual page links.

andrew2net commented 1 year ago

@ronaldtse I've tried all the codes from 00 to 99

responds = (0..99).map do |n|
  resp = Faraday.get "https://www.iso.org/contents/data/ics/#{n.to_s.rjust(2,'0')}.rss"
  n if resp.status == 200
end

responds.compact
=> [1, 3, 7, 11, 13, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 43, 45, 47, 49, 53, 55, 59, 61, 65, 67, 71, 73, 75, 77, 79, 81, 83, 85, 87, 91, 93, 95, 97]

responds.compact.size
=> 40

How do you think, can we use these ICS codes only to obtain the full list of published standards? Or the RSS's responses can vary from time to time, and we need to try each of the 100 codes every time?

ronaldtse commented 1 year ago

If ISO publishes in a new category, there will be new codes used. So we need to enumerate from all ICS codes.

andrew2net commented 1 year ago

@ronaldtse the pubid-iso is unable to convert some ISO identifiers to URN, so I'm going to drop URN in such cases. Is it ok?

ronaldtse commented 1 year ago

@andrew2net then that is a bug in pubid-iso — can you please help list the problematic ones? Thanks.

andrew2net commented 1 year ago

@ronaldtse I don't think this is a pubid-iso issue. The issue is that some ISO amendments have no edition in their identifiers, and without an edition, there cannot be an amendment URN. https://github.com/metanorma/pubid-iso/issues/102#issuecomment-1250556415 I'll try to ad an edition scraped from the document page.

ronaldtse commented 1 year ago

Thanks for the clarification, by any chance those are “approved but not yet published” documents (60.00)?

Yes without an edition there cannot be a URN according to RFC 5141.

andrew2net commented 1 year ago

@ronaldtse I got only 6853 documents with the RSS. The iso.org/search.html shows that it has 56147 records. What do you think about it? BTW I found an issue that slows down the ISO documents fetching significantly. A long time ago I found out that iso.org fails to render documents from time to time. So I added a test if a certain HTML element is exist in the response. If it doesn't then the scraper tries to get it again several times. It worked well until the site template was updated one day. I've updated the HTML element so the relaton-iso works much faster now. I'm, going to implement an error raising in case it will happen again.

ronaldtse commented 1 year ago

@andrew2net according to ISO:

ISO has developed over 24638 International Standards and all are included in the ISO Standards catalogue.

  <title><![CDATA[ISO 10096:1997 - Aerospace — Nuts, hexagonal, slotted (castellated), reduced height, reduced across flats, with MJ threads, classifications: 450 MPa (at ambient temperature)/425 degrees C, 600 MPa (at ambient temperature)/235 degrees C, 600 MPa (at ambient temperature)/315 degrees C, 600 MPa (at ambient temperature)/650 degrees C, 900 MPa (at ambient temperature)/235 degrees C, 900 MPa (at ambient temperature)/730 degrees C and 1 100 MPa (at ambient temperature)/600 degrees C — Dimensions]]></title>
  <link>https://www.iso.org/cms/render/live/en/sites/isoorg/contents/data/standard/01/80/18070.html</link>
  <guid>https://www.iso.org/cms/render/live/en/sites/isoorg/contents/data/standard/01/80/18070.html</guid>
  <description>
    <![CDATA[This document reached stage 90.93 on 2021-06-04, TC/SC: ISO/TC 20/SC 4, ICS: 49.030.30]]>
  </description>
  <pubDate>2021-06-04</pubDate>
</item><item>
  <title><![CDATA[ISO/DIS 10096 - Aerospace — Nuts, hexagonal, slotted (castellated), reduced height, reduced across flats, with MJ threads, classifications: 450 MPa (at ambient temperature)/425 degrees C, 600 MPa (at ambient temperature)/235 degrees C, 600 MPa (at ambient temperature)/315 degrees C, 600 MPa (at ambient temperature)/650 degrees C, 900 MPa (at ambient temperature)/235 degrees C, 900 MPa (at ambient temperature)/730 degrees C and 1 100 MPa (at ambient temperature)/600 degrees C — Dimensions]]></title>
  <link>https://www.iso.org/cms/render/live/en/sites/isoorg/contents/data/standard/07/71/77130.html</link>
  <guid>https://www.iso.org/cms/render/live/en/sites/isoorg/contents/data/standard/07/71/77130.html</guid>
  <description>
    <![CDATA[This document reached stage 40.98 on 2021-06-04, TC/SC: ISO/TC 20/SC 4, ICS: 49.030.30]]>
  </description>
  <pubDate>2021-06-04</pubDate>
</item><item>

It looks like the RSS feed has a length limit.

The top-level ICS code RSS feed does not provide the full list of items:

$ curl https://www.iso.org/contents/data/ics/01.rss > 01.rss; grep title 01.rss | wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  106k    0  106k    0     0  54163      0 --:--:--  0:00:02 --:--:-- 54349
     202
$ curl https://www.iso.org/contents/data/ics/03.rss > 03.rss; grep title 03.rss | wc -l
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  109k    0  109k    0     0  27365      0 --:--:--  0:00:04 --:--:-- 27398
     202

The limit seems to be 202.

I wonder if the proper way to do it is to loop through all ICS codes to obtain a list?

Example 1:

Example 2:

andrew2net commented 1 year ago

Pubid::Iso fails to parse identifiers like these: ISO 1942:1983/Add 1:1983/Add 6:1985 ISO 1942:1983/Add 1:1983 ISO 1942:1983/Add 1:1983/Add 3:1983 ISO 1942:1983/Add 1:1983/Add 2:1983 ISO 1942:1983/Add 1:1983/Add 5:1985 ISO 1942:1983/Add 1:1983/Add 4:1984 ISO 5742:1982/Add 1:1985 ISO/TR 8373:1988/Add 1:1990

andrew2net commented 1 year ago

@ronaldtse I've tried to fetch all ISO standards from the https://www.iso.org/standards-catalogue/browse-by-ics.html ICS list. It taked 6 hours and I got 51243 documents. There are some errors and duplications, so the script needs to be implroved. The only issue is that the ICS pages doesn't provide updated date, so we need to fetch all the documents every time.

ronaldtse commented 1 year ago

Re: updated date. How about this information that comes in the description?

<![CDATA[This document reached stage 50.00 on 2023-01-30, TC/SC: ISO/TC 46/SC 10, ICS: 01.140.20]]>
<!-- ... -->
<![CDATA[This document reached stage 40.99 on 2023-01-28, TC/SC: ISO/TC 321, ICS: 01.040.03; 01.040.35; 03.080.30; 35.240.63]]>
andrew2net commented 1 year ago

@ronaldtse with parsing the ICS pages we can get only information from document pages that we scrape now. For example https://www.iso.org/standard/18070.html

andrew2net commented 1 year ago

@ronaldtse I've checked all ICS codes from https://www.iso.org/standards-catalogue/browse-by-ics.html. There are 750 ICS codes on the browse-by-ics pages. The RSS service returns it's maximun 200 records for 65 ICS codes. We can get up to 46417 documents with the RSS channels, but some documents aren't available due to the 200 items limitation.

ICS doc numbers

'01.020': 96 01.040.01: 40 01.040.03: 55 01.040.07: 30 01.040.11: 90 01.040.13: 108 01.040.17: 54 01.040.19: 18 01.040.21: 67 01.040.23: 41 01.040.25: 117 01.040.27: 51 01.040.31: 19 01.040.33: 1 01.040.35: 133 01.040.37: 64 01.040.39: 9 01.040.43: 83 01.040.45: 2 01.040.47: 25 01.040.49: 33 01.040.53: 91 01.040.55: 20 01.040.59: 103 01.040.61: 8 01.040.65: 64 01.040.67: 42 01.040.71: 43 01.040.73: 27 01.040.75: 21 01.040.77: 57 01.040.79: 68 01.040.81: 14 01.040.83: 75 01.040.85: 18 01.040.87: 28 01.040.91: 57 01.040.93: 12 01.040.97: 32 '01.060': 128 '01.070': 38 '01.075': 4 01.080.01: 19 01.080.10: 122 01.080.20: 196 01.080.30: 83 01.080.40: 4 01.080.50: 44 01.080.99: 16 01.100.01: 68 01.100.20: 124 01.100.30: 44 01.100.40: 41 01.100.99: 8 '01.110': 60 '01.120': 79 '01.140': 199 '03.020': 5 '03.060': 150 03.080.01: 11 03.080.10: 11 03.080.20: 5 03.080.30: 115 03.080.99: 41 03.100.01: 159 03.100.02: 13 03.100.10: 2 03.100.20: 6 03.100.30: 194 03.100.40: 18 03.100.70: 187 03.120.10: 113 03.120.20: 131 03.120.30: 200 '03.140': 8 '03.160': 5 '03.180': 20 03.200.01: 19 03.200.10: 8 03.200.99: 31 03.220.01: 200 03.220.20: 200 03.220.30: 4 03.220.40: 2 '03.240': 1 '07.030': 32 '07.040': 5 '07.060': 40 '07.080': 52 '07.100': 200 '07.120': 132 '07.140': 6 '11.020': 37 11.040.01: 77 11.040.10: 200 11.040.20: 168 11.040.25: 139 11.040.30: 18 11.040.40: 200 11.040.50: 3 11.040.55: 49 11.040.60: 6 11.040.70: 200 11.040.99: 29 11.060.01: 52 11.060.10: 200 11.060.15: 23 11.060.20: 200 11.060.25: 20 11.080.01: 126 11.080.10: 14 11.080.30: 15 '11.100': 176 11.120.10: 66 11.120.99: 23 '11.140': 28 '11.160': 1 '11.180': 200 '11.200': 69 13.020.01: 22 13.020.10: 72 13.020.20: 100 13.020.30: 7 13.020.40: 81 13.020.50: 15 13.020.60: 28 13.020.99: 33 13.030.01: 1 13.030.10: 1 13.030.20: 16 13.030.30: 29 13.030.40: 10 13.030.50: 11 13.040.01: 47 13.040.20: 107 13.040.30: 80 13.040.35: 28 13.040.40: 56 13.040.50: 99 13.040.99: 2 '13.060': 200 13.080.01: 42 13.080.05: 59 13.080.10: 93 13.080.20: 64 13.080.30: 88 13.080.40: 14 13.080.99: 9 '13.100': 61 '13.110': 149 '13.120': 2 '13.140': 82 '13.160': 132 '13.180': 200 '13.200': 4 13.220.01: 113 13.220.10: 132 13.220.20: 129 13.220.40: 157 13.220.50: 170 13.220.99: 19 '13.230': 4 '13.240': 31 '13.280': 161 '13.300': 12 '13.310': 17 '13.320': 8 13.340.01: 3 13.340.10: 158 13.340.20: 70 13.340.30: 64 13.340.40: 30 13.340.50: 78 13.340.60: 12 13.340.70: 31 13.340.99: 10 '17.020': 69 17.040.01: 23 17.040.10: 45 17.040.20: 99 17.040.30: 80 17.040.40: 48 '17.060': 79 '17.080': 1 17.120.01: 1 17.120.10: 91 17.120.20: 168 17.140.01: 102 17.140.20: 200 17.140.30: 94 17.140.50: 1 '17.160': 200 17.180.01: 20 17.180.20: 32 17.180.30: 35 17.200.20: 17 '17.240': 200 '19.020': 13 '19.040': 1 '19.060': 3 '19.080': 1 '19.100': 161 '19.120': 106 '21.020': 13 21.040.01: 7 21.040.10: 41 21.040.20: 8 21.040.30: 16 '21.060': 200 21.100.01: 2 21.100.10: 197 21.100.20: 200 21.120.10: 4 21.120.20: 1 21.120.30: 20 21.120.40: 45 '21.140': 9 '21.160': 15 '21.200': 141 21.220.01: 1 21.220.10: 86 21.220.30: 32 '21.240': 2 23.020.30: 7 23.020.35: 200 23.020.40: 49 '23.040': 200 23.060.01: 68 23.060.10: 3 23.060.20: 7 23.060.30: 6 23.060.40: 9 23.060.50: 3 23.060.99: 7 '23.080': 35 23.100.01: 66 23.100.10: 36 23.100.20: 103 23.100.40: 140 23.100.50: 74 23.100.60: 133 23.100.99: 25 '23.120': 93 '23.140': 39 '23.160': 57 '25.030': 65 25.040.01: 70 25.040.10: 21 25.040.20: 71 25.040.30: 57 25.040.40: 200 25.040.99: 1 25.060.01: 2 25.060.10: 21 25.060.20: 127 25.060.99: 13 25.080.01: 65 25.080.10: 12 25.080.20: 36 25.080.30: 9 25.080.40: 11 25.080.50: 25 25.080.99: 2 '25.100': 200 25.120.10: 117 25.120.20: 6 25.120.30: 49 25.120.40: 4 25.140.01: 50 25.140.10: 77 25.140.30: 125 25.160.01: 104 25.160.10: 170 25.160.20: 143 25.160.30: 146 25.160.40: 198 25.160.50: 62 25.180.01: 17 '25.200': 11 25.220.01: 5 25.220.10: 127 25.220.20: 177 25.220.40: 178 25.220.50: 90 25.220.60: 2 25.220.99: 4 '27.010': 11 '27.015': 25 '27.020': 198 '27.040': 48 27.060.10: 14 27.060.20: 31 27.060.30: 23 '27.075': 4 '27.080': 56 '27.100': 6 27.120.01: 12 27.120.10: 23 27.120.20: 32 27.120.30: 119 27.120.99: 1 '27.140': 5 '27.160': 38 '27.180': 6 '27.190': 76 '27.200': 29 '27.220': 22 '29.020': 2 '29.030': 1 29.035.50: 1 29.040.10: 2 29.130.20: 1 29.140.20: 2 29.160.01: 6 29.160.20: 3 29.160.30: 2 29.160.40: 43 29.160.99: 5 '29.180': 10 '29.220': 8 29.240.10: 1 29.260.20: 12 '29.280': 1 '31.020': 2 31.080.01: 6 '31.120': 1 '31.200': 1 '31.260': 103 '33.020': 3 33.040.35: 139 33.040.40: 1 33.050.30: 2 33.100.01: 19 33.100.20: 56 33.120.30: 2 33.160.50: 1 '35.020': 200 '35.030': 200 35.040.01: 1 35.040.10: 135 35.040.30: 196 35.040.40: 200 35.040.50: 200 35.040.99: 15 '35.060': 200 '35.080': 200 '35.100': 200 '35.110': 200 '35.140': 154 '35.160': 36 '35.180': 150 '35.200': 200 '35.210': 38 35.220.01: 3 35.220.10: 8 35.220.20: 22 35.220.21: 48 35.220.22: 31 35.220.23: 44 35.220.30: 122 35.240.01: 12 35.240.10: 19 35.240.15: 200 35.240.20: 200 35.240.30: 200 35.240.40: 110 35.240.50: 76 35.240.60: 200 35.240.63: 77 35.240.67: 70 35.240.68: 1 35.240.70: 169 35.240.80: 200 35.240.90: 79 35.240.99: 154 '35.260': 33 '37.020': 200 37.040.01: 18 37.040.10: 122 37.040.20: 198 37.040.25: 20 37.040.30: 85 37.040.99: 51 '37.060': 200 '37.080': 145 37.100.01: 102 37.100.10: 117 37.100.20: 14 37.100.99: 96 39.040.01: 19 39.040.10: 55 39.040.20: 4 39.040.99: 2 '39.060': 49 '43.020': 200 43.040.01: 9 43.040.10: 200 43.040.15: 200 43.040.20: 38 43.040.30: 34 43.040.40: 135 43.040.50: 63 43.040.60: 31 43.040.65: 33 43.040.70: 59 43.040.80: 77 43.040.99: 4 '43.060': 200 '43.080': 46 '43.100': 68 '43.120': 77 '43.140': 158 '43.150': 75 '43.160': 14 '43.180': 159 '45.020': 24 '45.060': 30 '45.080': 28 47.020.01: 120 47.020.05: 14 47.020.10: 65 47.020.20: 22 47.020.30: 50 47.020.40: 28 47.020.50: 112 47.020.60: 8 47.020.70: 95 47.020.80: 11 47.020.85: 1 47.020.90: 22 47.020.99: 122 '47.040': 38 '47.060': 67 '47.080': 200 '49.020': 143 49.025.01: 14 49.025.20: 1 49.030.01: 4 49.030.10: 21 49.030.20: 59 49.030.30: 69 49.030.50: 5 49.030.60: 19 49.030.99: 10 '49.035': 63 '49.040': 10 '49.045': 2 '49.050': 4 '49.060': 198 '49.080': 168 '49.090': 10 '49.095': 1 '49.100': 128 '49.120': 43 '49.140': 200 53.020.01: 6 53.020.20: 188 53.020.30: 60 53.020.99: 16 '53.040': 200 '53.060': 177 '53.080': 3 '53.100': 200 '55.020': 94 '55.040': 2 '55.080': 21 '55.100': 75 '55.120': 35 '55.130': 4 '55.140': 9 55.180.01: 9 55.180.10: 121 55.180.20: 42 55.180.30: 35 55.180.40: 42 55.180.99: 4 '59.020': 5 '59.060': 149 59.080.01: 200 59.080.20: 28 59.080.30: 153 59.080.40: 96 59.080.50: 60 59.080.70: 95 59.100.01: 17 59.100.10: 81 59.100.20: 17 '59.120': 200 59.140.01: 1 59.140.20: 20 59.140.30: 200 59.140.99: 2 '61.020': 45 '61.040': 2 '61.060': 150 '61.080': 9 65.020.01: 6 65.020.20: 16 65.020.30: 16 65.040.10: 15 65.040.20: 3 65.040.99: 18 '65.060': 200 '65.080': 82 65.100.01: 32 '65.120': 80 '65.140': 3 '65.145': 2 '65.150': 22 '65.160': 197 '67.020': 30 '67.040': 10 '67.050': 71 '67.060': 154 67.080.01: 66 67.080.10: 57 67.080.20: 39 67.100.01: 96 67.100.10: 145 67.100.20: 21 67.100.30: 55 67.100.40: 7 67.100.99: 34 67.120.10: 40 67.120.20: 1 67.120.30: 10 67.140.10: 53 67.140.20: 46 67.140.30: 11 67.160.20: 5 67.180.10: 1 67.180.20: 32 '67.190': 3 67.200.10: 194 67.200.20: 74 67.220.10: 126 67.220.20: 17 '67.240': 85 '67.250': 31 '67.260': 29 '71.020': 8 71.040.10: 6 71.040.20: 60 71.040.30: 42 71.040.40: 200 71.040.50: 28 71.040.99: 21 '71.060': 200 71.080.10: 27 71.080.15: 15 71.080.20: 27 71.080.30: 19 71.080.40: 43 71.080.50: 19 71.080.60: 62 71.080.70: 15 71.080.80: 32 71.080.90: 16 71.100.01: 2 71.100.10: 132 71.100.20: 74 71.100.30: 27 71.100.40: 122 71.100.45: 8 71.100.50: 1 71.100.60: 200 71.100.70: 34 71.120.30: 1 71.120.99: 14 '73.020': 32 '73.040': 191 '73.060': 200 '73.080': 53 73.100.30: 36 73.100.40: 18 73.100.99: 1 '73.120': 32 '75.020': 59 '75.040': 8 '75.060': 118 '75.080': 192 '75.100': 120 '75.120': 49 '75.140': 7 75.160.10: 130 75.160.20: 99 75.160.30: 27 75.160.40: 85 75.180.01: 32 75.180.10: 200 75.180.20: 67 75.180.30: 99 '75.200': 200 '77.020': 3 77.040.10: 200 77.040.20: 77 77.040.30: 2 77.040.99: 52 '77.080': 157 '77.100': 59 77.120.01: 5 77.120.10: 38 77.120.20: 42 77.120.30: 40 77.120.40: 51 77.120.50: 12 77.120.60: 34 '77.140': 200 77.150.01: 11 77.150.10: 90 77.150.20: 11 77.150.30: 52 77.150.40: 10 77.150.50: 3 77.150.60: 2 '77.160': 170 '77.180': 9 '79.020': 5 '79.040': 87 79.060.01: 35 79.060.10: 32 79.060.20: 29 79.060.99: 8 '79.080': 28 '79.100': 124 79.120.10: 72 79.120.20: 9 81.040.01: 36 81.040.10: 14 81.040.20: 92 81.040.30: 37 81.060.01: 5 81.060.10: 2 81.060.20: 9 81.060.30: 193 '81.080': 118 '83.020': 2 83.040.01: 6 83.040.10: 200 83.040.20: 138 83.040.30: 10 '83.060': 200 '83.080': 200 '83.100': 136 '83.120': 116 '83.140': 200 83.160.01: 66 83.160.10: 91 83.160.20: 11 83.160.30: 51 83.160.99: 30 '83.180': 151 '83.200': 14 '85.020': 7 '85.040': 200 '85.060': 200 '85.080': 57 '85.100': 3 '87.020': 24 '87.040': 200 87.060.01: 4 87.060.10: 160 87.060.20: 78 87.060.30: 5 87.060.99: 6 '87.080': 43 '87.100': 4 91.010.01: 35 91.010.20: 20 91.010.30: 19 91.040.01: 131 91.040.10: 16 91.060.01: 12 91.060.10: 13 91.060.20: 5 91.060.30: 17 91.060.40: 1 91.060.50: 65 91.060.99: 5 91.080.01: 59 91.080.10: 3 91.080.13: 8 91.080.20: 65 91.080.30: 3 91.080.40: 90 '91.090': 2 91.100.01: 18 91.100.10: 48 91.100.23: 66 91.100.30: 90 91.100.40: 55 91.100.50: 83 91.100.60: 110 91.120.10: 138 91.120.20: 125 91.120.25: 33 91.140.01: 14 91.140.10: 1 91.140.30: 62 91.140.40: 14 91.140.60: 200 91.140.70: 2 91.140.80: 102 91.140.90: 97 91.160.01: 3 91.160.10: 12 '91.200': 28 '91.220': 40 '93.010': 17 '93.020': 102 '93.025': 66 '93.030': 60 '93.040': 12 '93.060': 4 93.080.10: 28 93.080.20: 13 93.080.30: 6 '93.100': 1 '95.020': 10 '97.020': 12 97.040.01: 5 97.040.10: 1 97.040.20: 9 97.040.30: 17 97.040.40: 5 97.040.60: 32 '97.060': 32 97.100.01: 5 97.100.30: 1 97.100.99: 8 '97.120': 1 97.130.20: 23 '97.140': 76 '97.150': 162 '97.160': 6 '97.170': 30 '97.180': 35 '97.190': 13 '97.195': 1 97.200.10: 3 97.200.30: 26 97.200.40: 10 97.200.50: 44 97.200.99: 3 97.220.01: 18 97.220.10: 16 97.220.20: 146 97.220.30: 31 97.220.40: 24 97.220.99: 2

ronaldtse commented 1 year ago

Thank you @andrew2net for the experimentation. It is unfortunate that we cannot obtain all the documents via RSS, which is much faster and presents less load on iso.org.

What is the alternative to obtain the index of all documents? Does Algolia provide some options here?

andrew2net commented 1 year ago

@ronaldtse I don't see any other way but scrape all the pages https://www.iso.org/standards-catalogue/browse-by-ics.html It may take too much time to parse all the documents. However, I think we can get all the documents from ICS once and then use RSS to update our dataset. What do you think?

ronaldtse commented 1 year ago

@andrew2net I agree, let's:

  1. Scrape all pages of https://www.iso.org/standards-catalogue/browse-by-ics.html
  2. Scrape all documents via RSS links.

Thanks!

andrew2net commented 8 months ago

@ronaldtse there is a document with ISO/CIE TR 21783:2022 | ISO/CIE TR 21783 ID. Guess we should use just ISO/CIE TR 21783:2022, right?

ronaldtse commented 8 months ago

ISO/CIE TR 21783:2022 | ISO/CIE TR 21783 ID. Guess we should use just ISO/CIE TR 21783:2022, right?

I believe so. However this is an ISO problem, we should report it.

I've reported to Luigi Principi the ISO webmaster.

andrew2net commented 8 months ago

I believe so. However this is an ISO problem, we should report it.

I didn't mean we should report the problem. It concerns what we should store in our dataset.

ronaldtse commented 8 months ago

I believe so. However this is an ISO problem, we should report it.

I didn't mean we should report the problem. It concerns what we should store in our dataset.

I agree with your proposed solution, but I’m hesitant to make a single hard coded exception… any other issues you’ve found?

andrew2net commented 8 months ago

I agree with your proposed solution, but I’m hesitant to make a single hard coded exception… any other issues you’ve found?

@ronaldtse I've tried to fetch only 10k docs of 60k. We'll see all issues in GHA log. For now:

andrew2net commented 8 months ago

agree with your proposed solution, but I’m hesitant to make a single hard coded exception… any other issues you’ve found?

@ronaldtse I have only parsed 10k of 60k docs. For now I found:

andrew2net commented 8 months ago

@ronaldtse I haven't parsed all the ISO documents yet, only 10k of about 60k. For now, the issues are:

ronaldtse commented 8 months ago

Default document type: yes.

DATA document types. Yes we can fix them later.

Duplicated IDs. This instance is when the project was cancelled and then re-started, which was then withdrawn. I think for “status: deleted” items we just ignore them for now.

andrew2net commented 8 months ago

@ronaldtse in the static dataset there are docs with ISO/IEC DIR relation. The Pubi::Iso fails to parse ISO/IEC DIR ID and there isn't ISO/IEC DIR doc in the dataset. Is the relation correct? If the relation correct then we need to fix Pubid::Iso.

ronaldtse commented 8 months ago

The Pubi::Iso fails to parse ISO/IEC DIR ID and there isn't ISO/IEC DIR doc in the dataset. Is the relation correct? If the relation correct then we need to fix Pubid::Iso.

In this case let's remove the ISO/IEC DIR relation, because ISO/IEC DIR is analogous to ISO/IEC TR which doesn't quite make sense as a citation target.

andrew2net commented 7 months ago

@ronaldtse At least one document isn't listed on the ICS pages https://www.iso.org/standard/77374.html The doc has deleted status, so I think we can ignore it, right?

ronaldtse commented 7 months ago

@andrew2net I think we can ignore it, but it does contain ICS:

Screenshot 2024-01-30 at 3 43 33 PM
andrew2net commented 7 months ago

@ronaldtse yes, it contains ICS. I've opened the 43.020 and 17.140.30 ICS pages and didn't find the doc there, so we are unable to get the doc from the ICS pages.

ronaldtse commented 7 months ago

@andrew2net then probably because it was out of range. There is a page limit for ICS pages, right?

In any case, we probably should keep a list of all the project numbers (the 5-digit IDs) because ISO does not re-use them.

andrew2net commented 7 months ago

@ronaldtse we can extract all the project numbers that we have in the relaton-data-iso repo from index

andrew2net commented 7 months ago

@andrew2net then probably because it was out of range. There is a page limit for ICS pages, right?

@ronaldtse I don't think there is a page limit. The page 43.020 has 220 docs, but 17.140.30 has only 92.

ronaldtse commented 6 months ago

Then is the problem because this project was "deleted" that's why it's not shown in the ICS pages?

andrew2net commented 6 months ago

implemented in v1.18.2