italia / publiccode-crawler

publiccode.yml crawler for the Open Source software catalog of Developers Italia
GNU Affero General Public License v3.0
28 stars 52 forks source link

Once in a while the crawler doesn't expand the URLs in logos #170

Closed bfabio closed 2 years ago

bfabio commented 4 years ago

When the path in logo is relative, the crawler is supposed to expand it to the full URL and export the normalized file, but sometimes it doesn't do it:

today:

_site/en/software/p_tn-provinciaautonomatrento-pitre.html
  *  internal image screenshots/pitre_logo.png does not exist (line 632)

August 16th:

- ./_site/en/software/p_ve-cittametropolitanavenezia-desk-kitriuso-sicla.html
  *  internal image logo_header2.fw_.png does not exist (line 632)
- ./_site/it/software/p_ve-cittametropolitanavenezia-desk-kitriuso-sicla.html
  *  internal image logo_header2.fw_.png does not exist (line 632)
bfabio commented 4 years ago

pcvalidator -export expands the URLs only if -remote-base-url is passed. This is documented in the source but not to the user.

@sebbalex is there a chance the crawler could run it with no or empty RemoteBaseURL?

sebbalex commented 4 years ago

@sebbalex is there a chance the crawler could run it with no or empty RemoteBaseURL?

RemoteBaseURL is needed to enforce absolute and relative url validation:

If left empty, absolute URLs will not be validated and no remote validation of files with relative paths will be performed.

furthermore there was no evidence of this in the past and since no changes were made in our codebase this confuses me.

bfabio commented 4 years ago

RemoteBaseURL is needed to enforce absolute and relative url validation:

Sorry I wasn't clear. I was trying to say that the crawler being run with an empty RemoteBaseURL for some exotic reasons could explain what we are seeing. I'm just as puzzled as you. :thinking:

sebbalex commented 4 years ago

In latest run I noticed about this timeout problems, I think this is related to URL expand issue we got here.

time="2020-09-24T08:35:11Z" level=error msg="Error parsing publiccode.yml: logo: HTTP GET failed for https://raw.githubusercontent.com/AgID/rndt-joomla-template/master/documentation/images/logo-rndt.png: Get https://raw.githubusercontent.com/AgID/rndt-joomla-template/master/documentation/images/logo-rndt.png: dial tcp 151.101.36.133:443: i/o timeout"
time="2020-09-24T08:35:12Z" level=error msg="Error parsing publiccode.yml: logo: HTTP GET failed for https://raw.githubusercontent.com/AgID/rndt-catalogue/master/documentation/images/logo-rndt.png: Get https://raw.githubusercontent.com/AgID/rndt-catalogue/master/documentation/images/logo-rndt.png: dial tcp 151.101.36.133:443: i/o timeout"
time="2020-09-24T08:35:13Z" level=error msg="Error parsing publiccode.yml: logo: HTTP GET failed for https://raw.githubusercontent.com/italia/18app/master/src/Italia.DiciottoApp.iOS/Assets.xcassets/AppIcon.appiconset/Icon120.png: Get https://raw.githubusercontent.com/italia/18app/master/src/Italia.DiciottoApp.iOS/Assets.xcassets/AppIcon.appiconset/Icon120.png: dial tcp 151.101.36.133:443: i/o timeout"
time="2020-09-24T08:35:13Z" level=error msg="Error parsing publiccode.yml: description/it/screenshots: HTTP GET failed for https://raw.githubusercontent.com/consiglionazionaledellericerche/cool-jconon/master/docs/screenshot/responsive_it.png: Get https://raw.githubusercontent.com/consiglionazionaledellericerche/cool-jconon/master/docs/screenshot/responsive_it.png: dial tcp 151.101.36.133:443: i/o timeout\ndescription/en/screenshots: HTTP GET failed for https://raw.githubusercontent.com/consiglionazionaledellericerche/cool-jconon/master/docs/screenshot/home_en.png: Get https://raw.githubusercontent.com/consiglionazionaledellericerche/cool-jconon/master/docs/screenshot/home_en.png: dial tcp 151.101.36.133:443: i/o timeout"
time="2020-09-24T08:35:14Z" level=error msg="Error parsing publiccode.yml: description/it/screenshots: HTTP GET failed for https://raw.githubusercontent.com/vvfosprojects/sovvf/master/doc/images/dashboard.jpg: Get https://raw.githubusercontent.com/vvfosprojects/sovvf/master/doc/images/dashboard.jpg: dial tcp 151.101.36.133:443: i/o timeout"
time="2020-09-24T08:35:14Z" level=error msg="Error parsing publiccode.yml: description/it/screenshots: HTTP GET failed for https://raw.githubusercontent.com/IstitutoCentraleCatalogoUnicoBiblio/Nuovo-Opac-di-Polo-SBN/master/screenshots/nuovo_opac.png: Get https://raw.githubusercontent.com/IstitutoCentraleCatalogoUnicoBiblio/Nuovo-Opac-di-Polo-SBN/master/screenshots/nuovo_opac.png: dial tcp 151.101.36.133:443: i/o timeout"
bfabio commented 4 years ago

189 dramatically decreases the frequency of this happening.

Keeping this open though, because the root issue is not resolved.

sebbalex commented 4 years ago

189 dramatically decreases the frequency of this happening.

Keeping this open though, because the root issue is not resolved.

We could consider that root cause was the amount of concurrency process and close this, wdyt @bfabio ?

bfabio commented 4 years ago

@sebbalex I'm not convinced, there must be something wrong in the code that doesn't handle git failures correctly and still resolves the URL as relative. Most (all?) of the failures where caused by concurrency, but the crawler should have stopped processing the repo as soon as they happened.

bfabio commented 2 years ago

This doesn't apply anymore.

After #302 the crawler doesn't touch publiccode.yml's contents, APIs consumers are now in charge of doing the expansion, if they need it.