medialab / hyphe

Websites crawler with built-in exploration and control web interface
http://hyphe.medialab.sciences-po.fr/demo/
GNU Affero General Public License v3.0
328 stars 59 forks source link

[IMPORT URLS] Simulate Webentity creation rule should take the matching rule with the longest prefix #363

Closed boogheta closed 4 years ago

boogheta commented 4 years ago

In some cases it does the opposite, for instance when we have

- facebook.com: path-1
- facebook.com/pages: path-2
- facebook.com/groups: path-2
- facebook.com/people: path-2

and we try to import the following batch

https://www.facebook.com/dude
https://www.facebook.com/dude/
https://www.facebook.com/people/dude
https://www.facebook.com/people/dude
https://www.facebook.com/groups/dude
https://www.facebook.com/groups/dude/
https://www.facebook.com/pages/dude
https://www.facebook.com/pages/dude/
https://www.facebook.com/post/dude
https://www.facebook.com/post/dude/

It should propose all WEs completely to the right except for the last 2 but it only proposes path-1 for all

This may come from calls to get_potential_prefix in the traph? cc @Yomguithereal

Yomguithereal commented 4 years ago

The traph is geared to returning the longest prefix but maybe there is something not working as intended.

boogheta commented 4 years ago

Well in fact, the traph wasn't the culprit, it was due to www variations of default creation rules not being automatically created