safe-refuge / safeway-data

Data mining tools for the Safeway app
4 stars 4 forks source link

Normalize URLs in PointOfInterest #42

Closed littlepea closed 2 years ago

littlepea commented 2 years ago

The url in not in standard format. For example: "url": "www.pcpr-limanowa.pl" It missing the "https://" prefix.

idisblueflash commented 2 years ago

For URL has http:// already, should we keep it as it is? for example http://www.opsstarysacz.pl/

idisblueflash commented 2 years ago

note for myself, need a way to fix this edge case:

idisblueflash commented 2 years ago

And how about we add a new point_formatter.UrlFormatter as below?

class ConvertSpreadsheetData(Injector):
    usecase = convert_data.ConvertSpreadsheetData
    settings = settings.Settings(_env_file="config/.env.example")
    log = print
    spreadsheet_reader = google_sheets.GoogleSheetsReader
    adapter = spreadsheet_adapter.SpreadsheetAdapter
    address_sanitizer = address_sanitizer.AddressSanitizer
    point_formatter = point_formatter.UrlFormatter. # <-- added one
    geocoder = geocoding.GeoCodingProcessor
    translator = translation.PointTranslator
    error_collector = error_collector.ErrorCollector
    validator = composite_validator.CompositeValidator
    validators = [RequiredFieldsValidator(), CategoriesValidator()]
    csv_repository = csv.CSVRepository
littlepea commented 2 years ago

For URL has http:// already, should we keep it as it is?

Yes, I think we can keep it as is

littlepea commented 2 years ago

And how about we add a new point_formatter.UrlFormatter as below?

@idisblueflash no need, just add a new validator to the PointOfInterest model with pre=True (we already have a few) and fix it right there

littlepea commented 2 years ago

Some feedback about the Poland scraping data:

Note that some other URL are funny. Look at rows: [216, 243, 318, 419, 506, 729, 760, 782, 872, 959, 1221, 1446, 1638, 2160, 2388, 2513] (row # might be with 1 offset)

idisblueflash commented 2 years ago

@littlepea I like to confirm:

  1. for rows[729, 760, 1221, 1446, 2160, 2513], are they really empty, or I miss that URLs data?
  2. what should we handle case 506, www.klwow , just clean it as `, or we can add a tagError URL: www.klwow`?
  3. how about selecting the first url when we got multiple? 'http://gops.mielec.pl/ ; http://www.gops.ug.mielec.pl/'

Notes for issues above:

littlepea commented 2 years ago

for rows[729, 760, 1221, 1446, 2160, 2513], are they really empty, or I miss that URLs data?

If they are empty, that's fine, then we just leave them empty

what should we handle case 506, www.klwow , just clean it as ``, or we can add a tag Error URL: www.klwow?

just clean it as ``

how about selecting the first url when we got multiple? 'http://gops.mielec.pl/ ; http://www.gops.ug.mielec.pl/'

Agree

idisblueflash commented 2 years ago

what should we handle case 506, www.klwow , just clean it as ``, or we can add a tag Error URL: www.klwow?

just clean it as ``

I suggest clean email data in URL as `` as well

FYI: Here're some Errors I found:

Image

littlepea commented 2 years ago

I suggest clean email data in URL as `` as well

Agree