matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.74k stars 2.63k forks source link

Add possibility to configure simple wildcards in site URLs #16484

Open anthosz opened 4 years ago

anthosz commented 4 years ago

Hello,

I have an issue when I want to use import_logs.py and I check "Only track visits and actions when the action URL starts with one of the above URLs." once I use url like example.com/! (for a shortener url). My goal is to create a website with a report for all url/pages starting with "/!*".

Example: Url (tried also with https & a * and the end): http://example.com/!

Scenario 1 (doesn't works): Enabled: Only track visits and actions when the action URL starts with one of the above URLs. Log: example.com X.X.X.X [21/Sep/2020:14:30:01 +0200] "GET /!abcd" 200 ./import_logs.py --idsite=1 --url='http://example.com/piwik/' --recorders=3 --log-format-regex="(?P<host>\S+) (?P<ip>\S+) \[(?P<date>.*?) (?P<timezone>.*?)\] \"\S+ (?P<path>.*?)\" (?P<status>\S+)" access.log -> Nothing new in log_link_visit_action table

Scenario 2 (works): Disabled: Only track visits and actions when the action URL starts with one of the above URLs. Log: example.com X.X.X.X [21/Sep/2020:14:30:01 +0200] "GET /!abcd" 200 ./import_logs.py --idsite=1 --url='http://example.com/piwik/' --recorders=3 --log-format-regex="(?P<host>\S+) (?P<ip>\S+) \[(?P<date>.*?) (?P<timezone>.*?)\] \"\S+ (?P<path>.*?)\" (?P<status>\S+)" access.log -> New entry in log_link_visit_action table

Scenario 3 (works): Disabled: Only track visits and actions when the action URL starts with one of the above URLs. Log: example.com X.X.X.X [21/Sep/2020:14:30:01 +0200] "GET /!abcd" 200 ./import_logs.py --idsite=1 --url='http://example.com/piwik/' --recorders=3 --log-format-regex="(?P<host>\S+) (?P<ip>\S+) \[(?P<date>.*?) (?P<timezone>.*?)\] \"\S+ (?P<path>.*?)\" (?P<status>\S+)" --hostname=example.com --include-path='/!*' access.log -> New entry in log_link_visit_action table (so it works if I force the path in import_logs but not in matomo -> I need to launch several time the import_logs in this case)

In this case, my goal is not to use a path separated by slash (/) but by exclamation mark "!".

If you need more informations, doesn't hesitate.

Thank you!

anthosz commented 4 years ago

It seems that indeed, all separators are managed by slash in https://github.com/matomo-org/matomo/blob/3.14.1/plugins/SitesManager/SiteUrls.php

I don't know if you have something like a patch to allow other separator?

tsteur commented 4 years ago

@anthosz If I understand things correct what you are after then you're wanting to only match paths where the path starts with /!* vs currently Matomo would only support excluding URLs where the path is `/!/*? Do I understand this right?

This would be kind of on purpose currently if I understand things correctly since for Matomo there's currently no way to differentiate which behaviour someone expects.

anthosz commented 4 years ago

@anthosz Yes, that's what I would like, have the possibility to also take into account "/!*"

anthosz commented 4 years ago

A simple way can be to compare if import_url (url in log or request) like url (instead to force url/) -> use this website Also add an option to disable this behavior by default (so no impact on existing instance) and allow to enable it on demand

The bonus can be to allow regex in url (not related to this issue but can be usefull if someone want to use another separator (like "/(!|&)") ^^

tsteur commented 4 years ago

Thanks @anthosz I've updated the title to make it a bit more clear for us. Generally we would likely only be able to support some simply wildcards like * (if that's even possible) as I think we're sometimes might be using the site URLs also for other purposes maybe. To be checked.

Do I see this right it might already help if the include-path parameter in the log importer would support this in your case( eg include-path='/!*')?

anthosz commented 4 years ago

@tsteur yes and no, currently seems to works if we also specify the site ID but the issue is that in this case, we need to execute multiple time the imports_logs and it is slow (especially when we have more 10 millions of lines of logs to parse and multiples websites)

Starker3 commented 2 years ago

We got another request for this feature today.

The user would like to be able to use a wildcard for subdomains, for example: https://*.example.org instead of having to specify every subdomain individually.

mhh515 commented 2 years ago

Maybe even with the ability to use regular expressions, similar to the field "Excluded User Agents".

anthosz commented 2 years ago

be patient :)

Starker3 commented 1 year ago

We have another request from a Matomo user for this feature today.

ptemmer commented 11 months ago

Hey. Was support for regular expressions for website URLs added? My colleague assures me this used to work, however I can't seem to get it going myself, so it would be nice if you could confim.

Thanks