Closed ronaldtse closed 6 months ago
@ronaldtse I've tried all the codes from 00 to 99
responds = (0..99).map do |n|
resp = Faraday.get "https://www.iso.org/contents/data/ics/#{n.to_s.rjust(2,'0')}.rss"
n if resp.status == 200
end
responds.compact
=> [1, 3, 7, 11, 13, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 43, 45, 47, 49, 53, 55, 59, 61, 65, 67, 71, 73, 75, 77, 79, 81, 83, 85, 87, 91, 93, 95, 97]
responds.compact.size
=> 40
How do you think, can we use these ICS codes only to obtain the full list of published standards? Or the RSS's responses can vary from time to time, and we need to try each of the 100 codes every time?
If ISO publishes in a new category, there will be new codes used. So we need to enumerate from all ICS codes.
@ronaldtse the pubid-iso is unable to convert some ISO identifiers to URN, so I'm going to drop URN in such cases. Is it ok?
@andrew2net then that is a bug in pubid-iso — can you please help list the problematic ones? Thanks.
@ronaldtse I don't think this is a pubid-iso issue. The issue is that some ISO amendments have no edition in their identifiers, and without an edition, there cannot be an amendment URN. https://github.com/metanorma/pubid-iso/issues/102#issuecomment-1250556415 I'll try to ad an edition scraped from the document page.
Thanks for the clarification, by any chance those are “approved but not yet published” documents (60.00)?
Yes without an edition there cannot be a URN according to RFC 5141.
@ronaldtse I got only 6853 documents with the RSS. The iso.org/search.html
shows that it has 56147 records. What do you think about it?
BTW I found an issue that slows down the ISO documents fetching significantly. A long time ago I found out that iso.org fails to render documents from time to time. So I added a test if a certain HTML element is exist in the response. If it doesn't then the scraper tries to get it again several times. It worked well until the site template was updated one day. I've updated the HTML element so the relaton-iso works much faster now. I'm, going to implement an error raising in case it will happen again.
@andrew2net according to ISO:
ISO has developed over 24638 International Standards and all are included in the ISO Standards catalogue.
<title><![CDATA[ISO 10096:1997 - Aerospace — Nuts, hexagonal, slotted (castellated), reduced height, reduced across flats, with MJ threads, classifications: 450 MPa (at ambient temperature)/425 degrees C, 600 MPa (at ambient temperature)/235 degrees C, 600 MPa (at ambient temperature)/315 degrees C, 600 MPa (at ambient temperature)/650 degrees C, 900 MPa (at ambient temperature)/235 degrees C, 900 MPa (at ambient temperature)/730 degrees C and 1 100 MPa (at ambient temperature)/600 degrees C — Dimensions]]></title>
<link>https://www.iso.org/cms/render/live/en/sites/isoorg/contents/data/standard/01/80/18070.html</link>
<guid>https://www.iso.org/cms/render/live/en/sites/isoorg/contents/data/standard/01/80/18070.html</guid>
<description>
<![CDATA[This document reached stage 90.93 on 2021-06-04, TC/SC: ISO/TC 20/SC 4, ICS: 49.030.30]]>
</description>
<pubDate>2021-06-04</pubDate>
</item><item>
<title><![CDATA[ISO/DIS 10096 - Aerospace — Nuts, hexagonal, slotted (castellated), reduced height, reduced across flats, with MJ threads, classifications: 450 MPa (at ambient temperature)/425 degrees C, 600 MPa (at ambient temperature)/235 degrees C, 600 MPa (at ambient temperature)/315 degrees C, 600 MPa (at ambient temperature)/650 degrees C, 900 MPa (at ambient temperature)/235 degrees C, 900 MPa (at ambient temperature)/730 degrees C and 1 100 MPa (at ambient temperature)/600 degrees C — Dimensions]]></title>
<link>https://www.iso.org/cms/render/live/en/sites/isoorg/contents/data/standard/07/71/77130.html</link>
<guid>https://www.iso.org/cms/render/live/en/sites/isoorg/contents/data/standard/07/71/77130.html</guid>
<description>
<![CDATA[This document reached stage 40.98 on 2021-06-04, TC/SC: ISO/TC 20/SC 4, ICS: 49.030.30]]>
</description>
<pubDate>2021-06-04</pubDate>
</item><item>
It looks like the RSS feed has a length limit.
The top-level ICS code RSS feed does not provide the full list of items:
$ curl https://www.iso.org/contents/data/ics/01.rss > 01.rss; grep title 01.rss | wc -l
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 106k 0 106k 0 0 54163 0 --:--:-- 0:00:02 --:--:-- 54349
202
$ curl https://www.iso.org/contents/data/ics/03.rss > 03.rss; grep title 03.rss | wc -l
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 109k 0 109k 0 0 27365 0 --:--:-- 0:00:04 --:--:-- 27398
202
The limit seems to be 202.
I wonder if the proper way to do it is to loop through all ICS codes to obtain a list?
Example 1:
Example 2:
Pubid::Iso fails to parse identifiers like these: ISO 1942:1983/Add 1:1983/Add 6:1985 ISO 1942:1983/Add 1:1983 ISO 1942:1983/Add 1:1983/Add 3:1983 ISO 1942:1983/Add 1:1983/Add 2:1983 ISO 1942:1983/Add 1:1983/Add 5:1985 ISO 1942:1983/Add 1:1983/Add 4:1984 ISO 5742:1982/Add 1:1985 ISO/TR 8373:1988/Add 1:1990
@ronaldtse I've tried to fetch all ISO standards from the https://www.iso.org/standards-catalogue/browse-by-ics.html ICS list. It taked 6 hours and I got 51243 documents. There are some errors and duplications, so the script needs to be implroved. The only issue is that the ICS pages doesn't provide updated date, so we need to fetch all the documents every time.
Re: updated date. How about this information that comes in the description?
<![CDATA[This document reached stage 50.00 on 2023-01-30, TC/SC: ISO/TC 46/SC 10, ICS: 01.140.20]]>
<!-- ... -->
<![CDATA[This document reached stage 40.99 on 2023-01-28, TC/SC: ISO/TC 321, ICS: 01.040.03; 01.040.35; 03.080.30; 35.240.63]]>
@ronaldtse with parsing the ICS pages we can get only information from document pages that we scrape now. For example https://www.iso.org/standard/18070.html
@ronaldtse I've checked all ICS codes from https://www.iso.org/standards-catalogue/browse-by-ics.html. There are 750 ICS codes on the browse-by-ics pages. The RSS service returns it's maximun 200 records for 65 ICS codes. We can get up to 46417 documents with the RSS channels, but some documents aren't available due to the 200 items limitation.
'01.020': 96 01.040.01: 40 01.040.03: 55 01.040.07: 30 01.040.11: 90 01.040.13: 108 01.040.17: 54 01.040.19: 18 01.040.21: 67 01.040.23: 41 01.040.25: 117 01.040.27: 51 01.040.31: 19 01.040.33: 1 01.040.35: 133 01.040.37: 64 01.040.39: 9 01.040.43: 83 01.040.45: 2 01.040.47: 25 01.040.49: 33 01.040.53: 91 01.040.55: 20 01.040.59: 103 01.040.61: 8 01.040.65: 64 01.040.67: 42 01.040.71: 43 01.040.73: 27 01.040.75: 21 01.040.77: 57 01.040.79: 68 01.040.81: 14 01.040.83: 75 01.040.85: 18 01.040.87: 28 01.040.91: 57 01.040.93: 12 01.040.97: 32 '01.060': 128 '01.070': 38 '01.075': 4 01.080.01: 19 01.080.10: 122 01.080.20: 196 01.080.30: 83 01.080.40: 4 01.080.50: 44 01.080.99: 16 01.100.01: 68 01.100.20: 124 01.100.30: 44 01.100.40: 41 01.100.99: 8 '01.110': 60 '01.120': 79 '01.140': 199 '03.020': 5 '03.060': 150 03.080.01: 11 03.080.10: 11 03.080.20: 5 03.080.30: 115 03.080.99: 41 03.100.01: 159 03.100.02: 13 03.100.10: 2 03.100.20: 6 03.100.30: 194 03.100.40: 18 03.100.70: 187 03.120.10: 113 03.120.20: 131 03.120.30: 200 '03.140': 8 '03.160': 5 '03.180': 20 03.200.01: 19 03.200.10: 8 03.200.99: 31 03.220.01: 200 03.220.20: 200 03.220.30: 4 03.220.40: 2 '03.240': 1 '07.030': 32 '07.040': 5 '07.060': 40 '07.080': 52 '07.100': 200 '07.120': 132 '07.140': 6 '11.020': 37 11.040.01: 77 11.040.10: 200 11.040.20: 168 11.040.25: 139 11.040.30: 18 11.040.40: 200 11.040.50: 3 11.040.55: 49 11.040.60: 6 11.040.70: 200 11.040.99: 29 11.060.01: 52 11.060.10: 200 11.060.15: 23 11.060.20: 200 11.060.25: 20 11.080.01: 126 11.080.10: 14 11.080.30: 15 '11.100': 176 11.120.10: 66 11.120.99: 23 '11.140': 28 '11.160': 1 '11.180': 200 '11.200': 69 13.020.01: 22 13.020.10: 72 13.020.20: 100 13.020.30: 7 13.020.40: 81 13.020.50: 15 13.020.60: 28 13.020.99: 33 13.030.01: 1 13.030.10: 1 13.030.20: 16 13.030.30: 29 13.030.40: 10 13.030.50: 11 13.040.01: 47 13.040.20: 107 13.040.30: 80 13.040.35: 28 13.040.40: 56 13.040.50: 99 13.040.99: 2 '13.060': 200 13.080.01: 42 13.080.05: 59 13.080.10: 93 13.080.20: 64 13.080.30: 88 13.080.40: 14 13.080.99: 9 '13.100': 61 '13.110': 149 '13.120': 2 '13.140': 82 '13.160': 132 '13.180': 200 '13.200': 4 13.220.01: 113 13.220.10: 132 13.220.20: 129 13.220.40: 157 13.220.50: 170 13.220.99: 19 '13.230': 4 '13.240': 31 '13.280': 161 '13.300': 12 '13.310': 17 '13.320': 8 13.340.01: 3 13.340.10: 158 13.340.20: 70 13.340.30: 64 13.340.40: 30 13.340.50: 78 13.340.60: 12 13.340.70: 31 13.340.99: 10 '17.020': 69 17.040.01: 23 17.040.10: 45 17.040.20: 99 17.040.30: 80 17.040.40: 48 '17.060': 79 '17.080': 1 17.120.01: 1 17.120.10: 91 17.120.20: 168 17.140.01: 102 17.140.20: 200 17.140.30: 94 17.140.50: 1 '17.160': 200 17.180.01: 20 17.180.20: 32 17.180.30: 35 17.200.20: 17 '17.240': 200 '19.020': 13 '19.040': 1 '19.060': 3 '19.080': 1 '19.100': 161 '19.120': 106 '21.020': 13 21.040.01: 7 21.040.10: 41 21.040.20: 8 21.040.30: 16 '21.060': 200 21.100.01: 2 21.100.10: 197 21.100.20: 200 21.120.10: 4 21.120.20: 1 21.120.30: 20 21.120.40: 45 '21.140': 9 '21.160': 15 '21.200': 141 21.220.01: 1 21.220.10: 86 21.220.30: 32 '21.240': 2 23.020.30: 7 23.020.35: 200 23.020.40: 49 '23.040': 200 23.060.01: 68 23.060.10: 3 23.060.20: 7 23.060.30: 6 23.060.40: 9 23.060.50: 3 23.060.99: 7 '23.080': 35 23.100.01: 66 23.100.10: 36 23.100.20: 103 23.100.40: 140 23.100.50: 74 23.100.60: 133 23.100.99: 25 '23.120': 93 '23.140': 39 '23.160': 57 '25.030': 65 25.040.01: 70 25.040.10: 21 25.040.20: 71 25.040.30: 57 25.040.40: 200 25.040.99: 1 25.060.01: 2 25.060.10: 21 25.060.20: 127 25.060.99: 13 25.080.01: 65 25.080.10: 12 25.080.20: 36 25.080.30: 9 25.080.40: 11 25.080.50: 25 25.080.99: 2 '25.100': 200 25.120.10: 117 25.120.20: 6 25.120.30: 49 25.120.40: 4 25.140.01: 50 25.140.10: 77 25.140.30: 125 25.160.01: 104 25.160.10: 170 25.160.20: 143 25.160.30: 146 25.160.40: 198 25.160.50: 62 25.180.01: 17 '25.200': 11 25.220.01: 5 25.220.10: 127 25.220.20: 177 25.220.40: 178 25.220.50: 90 25.220.60: 2 25.220.99: 4 '27.010': 11 '27.015': 25 '27.020': 198 '27.040': 48 27.060.10: 14 27.060.20: 31 27.060.30: 23 '27.075': 4 '27.080': 56 '27.100': 6 27.120.01: 12 27.120.10: 23 27.120.20: 32 27.120.30: 119 27.120.99: 1 '27.140': 5 '27.160': 38 '27.180': 6 '27.190': 76 '27.200': 29 '27.220': 22 '29.020': 2 '29.030': 1 29.035.50: 1 29.040.10: 2 29.130.20: 1 29.140.20: 2 29.160.01: 6 29.160.20: 3 29.160.30: 2 29.160.40: 43 29.160.99: 5 '29.180': 10 '29.220': 8 29.240.10: 1 29.260.20: 12 '29.280': 1 '31.020': 2 31.080.01: 6 '31.120': 1 '31.200': 1 '31.260': 103 '33.020': 3 33.040.35: 139 33.040.40: 1 33.050.30: 2 33.100.01: 19 33.100.20: 56 33.120.30: 2 33.160.50: 1 '35.020': 200 '35.030': 200 35.040.01: 1 35.040.10: 135 35.040.30: 196 35.040.40: 200 35.040.50: 200 35.040.99: 15 '35.060': 200 '35.080': 200 '35.100': 200 '35.110': 200 '35.140': 154 '35.160': 36 '35.180': 150 '35.200': 200 '35.210': 38 35.220.01: 3 35.220.10: 8 35.220.20: 22 35.220.21: 48 35.220.22: 31 35.220.23: 44 35.220.30: 122 35.240.01: 12 35.240.10: 19 35.240.15: 200 35.240.20: 200 35.240.30: 200 35.240.40: 110 35.240.50: 76 35.240.60: 200 35.240.63: 77 35.240.67: 70 35.240.68: 1 35.240.70: 169 35.240.80: 200 35.240.90: 79 35.240.99: 154 '35.260': 33 '37.020': 200 37.040.01: 18 37.040.10: 122 37.040.20: 198 37.040.25: 20 37.040.30: 85 37.040.99: 51 '37.060': 200 '37.080': 145 37.100.01: 102 37.100.10: 117 37.100.20: 14 37.100.99: 96 39.040.01: 19 39.040.10: 55 39.040.20: 4 39.040.99: 2 '39.060': 49 '43.020': 200 43.040.01: 9 43.040.10: 200 43.040.15: 200 43.040.20: 38 43.040.30: 34 43.040.40: 135 43.040.50: 63 43.040.60: 31 43.040.65: 33 43.040.70: 59 43.040.80: 77 43.040.99: 4 '43.060': 200 '43.080': 46 '43.100': 68 '43.120': 77 '43.140': 158 '43.150': 75 '43.160': 14 '43.180': 159 '45.020': 24 '45.060': 30 '45.080': 28 47.020.01: 120 47.020.05: 14 47.020.10: 65 47.020.20: 22 47.020.30: 50 47.020.40: 28 47.020.50: 112 47.020.60: 8 47.020.70: 95 47.020.80: 11 47.020.85: 1 47.020.90: 22 47.020.99: 122 '47.040': 38 '47.060': 67 '47.080': 200 '49.020': 143 49.025.01: 14 49.025.20: 1 49.030.01: 4 49.030.10: 21 49.030.20: 59 49.030.30: 69 49.030.50: 5 49.030.60: 19 49.030.99: 10 '49.035': 63 '49.040': 10 '49.045': 2 '49.050': 4 '49.060': 198 '49.080': 168 '49.090': 10 '49.095': 1 '49.100': 128 '49.120': 43 '49.140': 200 53.020.01: 6 53.020.20: 188 53.020.30: 60 53.020.99: 16 '53.040': 200 '53.060': 177 '53.080': 3 '53.100': 200 '55.020': 94 '55.040': 2 '55.080': 21 '55.100': 75 '55.120': 35 '55.130': 4 '55.140': 9 55.180.01: 9 55.180.10: 121 55.180.20: 42 55.180.30: 35 55.180.40: 42 55.180.99: 4 '59.020': 5 '59.060': 149 59.080.01: 200 59.080.20: 28 59.080.30: 153 59.080.40: 96 59.080.50: 60 59.080.70: 95 59.100.01: 17 59.100.10: 81 59.100.20: 17 '59.120': 200 59.140.01: 1 59.140.20: 20 59.140.30: 200 59.140.99: 2 '61.020': 45 '61.040': 2 '61.060': 150 '61.080': 9 65.020.01: 6 65.020.20: 16 65.020.30: 16 65.040.10: 15 65.040.20: 3 65.040.99: 18 '65.060': 200 '65.080': 82 65.100.01: 32 '65.120': 80 '65.140': 3 '65.145': 2 '65.150': 22 '65.160': 197 '67.020': 30 '67.040': 10 '67.050': 71 '67.060': 154 67.080.01: 66 67.080.10: 57 67.080.20: 39 67.100.01: 96 67.100.10: 145 67.100.20: 21 67.100.30: 55 67.100.40: 7 67.100.99: 34 67.120.10: 40 67.120.20: 1 67.120.30: 10 67.140.10: 53 67.140.20: 46 67.140.30: 11 67.160.20: 5 67.180.10: 1 67.180.20: 32 '67.190': 3 67.200.10: 194 67.200.20: 74 67.220.10: 126 67.220.20: 17 '67.240': 85 '67.250': 31 '67.260': 29 '71.020': 8 71.040.10: 6 71.040.20: 60 71.040.30: 42 71.040.40: 200 71.040.50: 28 71.040.99: 21 '71.060': 200 71.080.10: 27 71.080.15: 15 71.080.20: 27 71.080.30: 19 71.080.40: 43 71.080.50: 19 71.080.60: 62 71.080.70: 15 71.080.80: 32 71.080.90: 16 71.100.01: 2 71.100.10: 132 71.100.20: 74 71.100.30: 27 71.100.40: 122 71.100.45: 8 71.100.50: 1 71.100.60: 200 71.100.70: 34 71.120.30: 1 71.120.99: 14 '73.020': 32 '73.040': 191 '73.060': 200 '73.080': 53 73.100.30: 36 73.100.40: 18 73.100.99: 1 '73.120': 32 '75.020': 59 '75.040': 8 '75.060': 118 '75.080': 192 '75.100': 120 '75.120': 49 '75.140': 7 75.160.10: 130 75.160.20: 99 75.160.30: 27 75.160.40: 85 75.180.01: 32 75.180.10: 200 75.180.20: 67 75.180.30: 99 '75.200': 200 '77.020': 3 77.040.10: 200 77.040.20: 77 77.040.30: 2 77.040.99: 52 '77.080': 157 '77.100': 59 77.120.01: 5 77.120.10: 38 77.120.20: 42 77.120.30: 40 77.120.40: 51 77.120.50: 12 77.120.60: 34 '77.140': 200 77.150.01: 11 77.150.10: 90 77.150.20: 11 77.150.30: 52 77.150.40: 10 77.150.50: 3 77.150.60: 2 '77.160': 170 '77.180': 9 '79.020': 5 '79.040': 87 79.060.01: 35 79.060.10: 32 79.060.20: 29 79.060.99: 8 '79.080': 28 '79.100': 124 79.120.10: 72 79.120.20: 9 81.040.01: 36 81.040.10: 14 81.040.20: 92 81.040.30: 37 81.060.01: 5 81.060.10: 2 81.060.20: 9 81.060.30: 193 '81.080': 118 '83.020': 2 83.040.01: 6 83.040.10: 200 83.040.20: 138 83.040.30: 10 '83.060': 200 '83.080': 200 '83.100': 136 '83.120': 116 '83.140': 200 83.160.01: 66 83.160.10: 91 83.160.20: 11 83.160.30: 51 83.160.99: 30 '83.180': 151 '83.200': 14 '85.020': 7 '85.040': 200 '85.060': 200 '85.080': 57 '85.100': 3 '87.020': 24 '87.040': 200 87.060.01: 4 87.060.10: 160 87.060.20: 78 87.060.30: 5 87.060.99: 6 '87.080': 43 '87.100': 4 91.010.01: 35 91.010.20: 20 91.010.30: 19 91.040.01: 131 91.040.10: 16 91.060.01: 12 91.060.10: 13 91.060.20: 5 91.060.30: 17 91.060.40: 1 91.060.50: 65 91.060.99: 5 91.080.01: 59 91.080.10: 3 91.080.13: 8 91.080.20: 65 91.080.30: 3 91.080.40: 90 '91.090': 2 91.100.01: 18 91.100.10: 48 91.100.23: 66 91.100.30: 90 91.100.40: 55 91.100.50: 83 91.100.60: 110 91.120.10: 138 91.120.20: 125 91.120.25: 33 91.140.01: 14 91.140.10: 1 91.140.30: 62 91.140.40: 14 91.140.60: 200 91.140.70: 2 91.140.80: 102 91.140.90: 97 91.160.01: 3 91.160.10: 12 '91.200': 28 '91.220': 40 '93.010': 17 '93.020': 102 '93.025': 66 '93.030': 60 '93.040': 12 '93.060': 4 93.080.10: 28 93.080.20: 13 93.080.30: 6 '93.100': 1 '95.020': 10 '97.020': 12 97.040.01: 5 97.040.10: 1 97.040.20: 9 97.040.30: 17 97.040.40: 5 97.040.60: 32 '97.060': 32 97.100.01: 5 97.100.30: 1 97.100.99: 8 '97.120': 1 97.130.20: 23 '97.140': 76 '97.150': 162 '97.160': 6 '97.170': 30 '97.180': 35 '97.190': 13 '97.195': 1 97.200.10: 3 97.200.30: 26 97.200.40: 10 97.200.50: 44 97.200.99: 3 97.220.01: 18 97.220.10: 16 97.220.20: 146 97.220.30: 31 97.220.40: 24 97.220.99: 2
Thank you @andrew2net for the experimentation. It is unfortunate that we cannot obtain all the documents via RSS, which is much faster and presents less load on iso.org.
What is the alternative to obtain the index of all documents? Does Algolia provide some options here?
@ronaldtse I don't see any other way but scrape all the pages https://www.iso.org/standards-catalogue/browse-by-ics.html It may take too much time to parse all the documents. However, I think we can get all the documents from ICS once and then use RSS to update our dataset. What do you think?
@andrew2net I agree, let's:
Thanks!
@ronaldtse there is a document with ISO/CIE TR 21783:2022 | ISO/CIE TR 21783
ID. Guess we should use just ISO/CIE TR 21783:2022
, right?
ISO/CIE TR 21783:2022 | ISO/CIE TR 21783
ID. Guess we should use justISO/CIE TR 21783:2022
, right?
I believe so. However this is an ISO problem, we should report it.
I've reported to Luigi Principi the ISO webmaster.
I believe so. However this is an ISO problem, we should report it.
I didn't mean we should report the problem. It concerns what we should store in our dataset.
I believe so. However this is an ISO problem, we should report it.
I didn't mean we should report the problem. It concerns what we should store in our dataset.
I agree with your proposed solution, but I’m hesitant to make a single hard coded exception… any other issues you’ve found?
I agree with your proposed solution, but I’m hesitant to make a single hard coded exception… any other issues you’ve found?
@ronaldtse I've tried to fetch only 10k docs of 60k. We'll see all issues in GHA log. For now:
international-standard
by default. What default doctype should we use for IEC?agree with your proposed solution, but I’m hesitant to make a single hard coded exception… any other issues you’ve found?
@ronaldtse I have only parsed 10k of 60k docs. For now I found:
DATA
type like ISO/DATA 3:1977
that Pubid::Iso
unable to parse. I'll create a ticket for @mico.IEC
docs, for example IEC/IEEE 80005-1:2019
. For ISO
docs we use default doctype international-standard
. Should we use the default doctype for IEC
docs?@ronaldtse I haven't parsed all the ISO documents yet, only 10k of about 60k. For now, the issues are:
international-standard
for ISO docs, but there are IEC docs like IEC/IEEE 80005-1:2019
. Should we use the same default doctype for the IEC docs?ISO/DATA 3:1977
that Pubi::Iso fails to parse. I'll create an issue for @mico ISO/IEC 8825-6:2008
. How should we handle them?Default document type: yes.
DATA document types. Yes we can fix them later.
Duplicated IDs. This instance is when the project was cancelled and then re-started, which was then withdrawn. I think for “status: deleted” items we just ignore them for now.
@ronaldtse in the static dataset there are docs with ISO/IEC DIR
relation. The Pubi::Iso
fails to parse ISO/IEC DIR
ID and there isn't ISO/IEC DIR
doc in the dataset. Is the relation correct? If the relation correct then we need to fix Pubid::Iso
.
The
Pubi::Iso
fails to parseISO/IEC DIR
ID and there isn'tISO/IEC DIR
doc in the dataset. Is the relation correct? If the relation correct then we need to fixPubid::Iso
.
In this case let's remove the ISO/IEC DIR
relation, because ISO/IEC DIR
is analogous to ISO/IEC TR
which doesn't quite make sense as a citation target.
@ronaldtse At least one document isn't listed on the ICS pages https://www.iso.org/standard/77374.html The doc has deleted status, so I think we can ignore it, right?
@andrew2net I think we can ignore it, but it does contain ICS:
@andrew2net then probably because it was out of range. There is a page limit for ICS pages, right?
In any case, we probably should keep a list of all the project numbers (the 5-digit IDs) because ISO does not re-use them.
@ronaldtse we can extract all the project numbers that we have in the relaton-data-iso
repo from index
Then is the problem because this project was "deleted" that's why it's not shown in the ICS pages?
implemented in v1.18.2
Due to the slow retrieval from ISO's site we have to build a static dataset.
The method to build the static dataset is through ISO's RSS feeds.
For each ICS code, they have an RSS feed:
An ICS RSS feed provides a list of standards like this:
Steps to retrieve standards
This way we can enumerate all published standards and the latest stages/dates.
Detecting updates
From the daily retrieval of ICS RSS feeds, we can detect if there are any changes to the documents as the RSS feeds provide the latest publication/stage dates. For items that have been updated, we can update using their individual page links.