soprasteria / cybersecurity-dfm

Data Feed Manager (news watch orchestrator to predict topic with deepdetect and store cleaned text in elasticsearch)
GNU General Public License v3.0
40 stars 14 forks source link

URI hash collision #12

Open BestPig opened 5 years ago

BestPig commented 5 years ago

There is a collision when generating md5 hash of URI when website use query for parameters

Example of a parsed "Cochonnet" youtube video.

>>> urlparse.urlparse(text_to_string('https://www.youtube.com/watch?v=30Nv0WY4Lg8'))
ParseResult(scheme='https', netloc='www.youtube.com', path='/watch', params='', query='v=30Nv0WY4Lg8', fragment='')
>>> 

If uri contents //, the used URI is a reconstruction of obj_uri.scheme + "://" + obj_uri.netloc + obj_uri.path But youtube pass the video id in params, so the md5 generated for all youtube videos is exactly the same because it doesn't take into account the query.

Here are all line where I found the bug: https://github.com/soprasteria/cybersecurity-dfm/blob/03bd533ea4ba43328f88f027cf81ac676022daa8/utils/dfmtelegrambot.py#L53

https://github.com/soprasteria/cybersecurity-dfm/blob/9b78c22d71f5178d3320582d8429fa37a8dedda7/dfm/server.py#L90

https://github.com/soprasteria/cybersecurity-dfm/blob/03bd533ea4ba43328f88f027cf81ac676022daa8/dfm/storage.py#L181

acabrol commented 5 years ago

parameters in url are removed to avoid duplicate news due to token or origin parameters that are different for each submission.