Closed littlepea closed 2 years ago
For URL has http://
already, should we keep it as it is?
for example http://www.opsstarysacz.pl/
note for myself, need a way to fix this edge case:
www.pcpr-jawor.https://sac.mpips.gov.pl:8443/Pomost/CKEditorServlet?TRYB=2
Empty URL
And how about we add a new point_formatter.UrlFormatter
as below?
class ConvertSpreadsheetData(Injector):
usecase = convert_data.ConvertSpreadsheetData
settings = settings.Settings(_env_file="config/.env.example")
log = print
spreadsheet_reader = google_sheets.GoogleSheetsReader
adapter = spreadsheet_adapter.SpreadsheetAdapter
address_sanitizer = address_sanitizer.AddressSanitizer
point_formatter = point_formatter.UrlFormatter. # <-- added one
geocoder = geocoding.GeoCodingProcessor
translator = translation.PointTranslator
error_collector = error_collector.ErrorCollector
validator = composite_validator.CompositeValidator
validators = [RequiredFieldsValidator(), CategoriesValidator()]
csv_repository = csv.CSVRepository
For URL has http:// already, should we keep it as it is?
Yes, I think we can keep it as is
And how about we add a new point_formatter.UrlFormatter as below?
@idisblueflash no need, just add a new validator to the PointOfInterest
model with pre=True
(we already have a few) and fix it right there
Some feedback about the Poland scraping data:
Note that some other URL are funny. Look at rows: [216, 243, 318, 419, 506, 729, 760, 782, 872, 959, 1221, 1446, 1638, 2160, 2388, 2513] (row # might be with 1 offset)
@littlepea I like to confirm:
www.klwow
, just clean it as `, or we can add a tag
Error URL: www.klwow`?'http://gops.mielec.pl/ ; http://www.gops.ug.mielec.pl/'
Notes for issues above:
for rows[729, 760, 1221, 1446, 2160, 2513], are they really empty, or I miss that URLs data?
If they are empty, that's fine, then we just leave them empty
what should we handle case 506, www.klwow , just clean it as ``, or we can add a tag Error URL: www.klwow?
just clean it as ``
how about selecting the first url when we got multiple? 'http://gops.mielec.pl/ ; http://www.gops.ug.mielec.pl/'
Agree
I suggest clean email data in URL as `` as well
Agree
The url in not in standard format. For example: "url": "www.pcpr-limanowa.pl" It missing the "https://" prefix.