Closed jimmycr closed 4 years ago
I found same "closed" question in Russian with result that there's not an algorithm to reduce. But I think it's not that hard to do it for "level" based setting. Just strip all levels of domain name equal or higher then set and save them (stripped) to separate file. Then remove duplicates in that file and you get list of unique domain names (stripped of "n-th+" domain). Then you read this file by 1 line and check how many times you can find this stripped domain in original file. If you find more than one record, you can write "regexp rule", if you find only 1 record, you can write "name rule" from original (unstripped) list.
For example: Original list: 10148.engine.mobileapptracking.com 10402.engine.mobileapptracking.com 10896.engine.mobileapptracking.com 13146.engine.mobileapptracking.com 13248.engine.mobileapptracking.com 14012.engine.mobileapptracking.com 15486.engine.mobileapptracking.com ads.2mdnsys.com cfa.2mdnsys.com static.twinpine.adatrix.com
Stripped list (level 4 selected): engine.mobileapptracking.com engine.mobileapptracking.com engine.mobileapptracking.com engine.mobileapptracking.com engine.mobileapptracking.com engine.mobileapptracking.com engine.mobileapptracking.com ads.2mdnsys.com cfa.2mdnsys.com twinpine.adatrix.com
Stripped list no duplicates (level 4 selected) using command awk '!seen[$0]++' filename
:
engine.mobileapptracking.com
ads.2mdnsys.com
cfa.2mdnsys.com
twinpine.adatrix.com
Final list: /ip dns static add address=127.0.0.1 comment="ADBlock" disabled=no regexp="engine.mobileapptracking.com" add address=127.0.0.1 comment="ADBlock" disabled=no name="ads.2mdnsys.com" add address=127.0.0.1 comment="ADBlock" disabled=no name="cfa.2mdnsys.com" add address=127.0.0.1 comment="ADBlock" disabled=no name="static.twinpine.adatrix.com"
I'm using regexp at home as rules and there is no need to write regular expression, just write "common" part.
Edit: algorithm explained
Another algorithm (maybe easier - I will try to prepare shell script):
Read original file - one by one line, strip selected domain levels, try to find duplicities of stripped string in original file. If exist, write as "regexp" rule (stripped domain) otherwise write "name" rule - whole domain. Deduplicate written (final) file with awk '!seen[$0]++' filename
so theres always one unique regexp rule for (stripped) domain.
So here is my shell script. It was tested with https://cdn.jsdelivr.net/gh/tarampampam/mikrotik-hosts-parser@master/.hosts/basic.txt
and https://adaway.org/hosts.txt
lists. It did about 20% less DNS rules on adaway hosts.txt file using parameter 4. You can save it for example as parse.sh
file and run with 2 parameters (validation of input parameters not implemented!). Then run script like:
./parse.sh ./hosts.txt 4
Where:
./hosts.txt - file you want to parse
4 - what level of domain name should be checked - 4 means cut the 4th+ level of domain address from behind
Example for parameter "4":
10148.engine.mobileapptracking.com => engine.mobileapptracking.com
abcdef.10148.engine.mobileapptracking.com => engine.mobileapptracking.com
zxcvb.abcdef.10148.engine.mobileapptracking.com => engine.mobileapptracking.com
The script might not be perfect, did it for fun and tried my best ;)
!/bin/sh
sed -i -e "s/(127.0.0.1 |::1)//g" $1 #strip "127.0.0.1" and "::1 " text from lines sed -i "/^$/d;/^#/d;/^ /d" $1 #remove comments and empty lines or lines starting with space rev $1 > ./reverse.tmp #reverse domain name - every line in file sed -i "s/.[^.]*//$2g" ./reverse.tmp #remove n-th domain part - parameter 2 of script rev ./reverse.tmp > ./filtered.tmp #reverse trimmed domain names again to normal state awk '!seen[$0]++' ./filtered.tmp >./unique.tmp #remove duplicate lines
printf "/ip dns static\n" >./ADblock.rsc #lets start writing our script to file ADblock.rsc :) while IFS="" read -r multi || [ -n "$multi" ] #read every line of unique addresses in unique.tmp do line=$(grep -m1 -r --only-matching $multi ./$1) #if found match with original file - save 1 occurence to variable count=$(grep -r --only-matching $multi ./$1 | wc -l) #also do a count of occurencies domain=$(echo "${multi}" | awk -F"." '{print NF-1}') #count "." in unique address domain=$(($domain+2)) #add 2 so we get "domain level" of address if [ $count -gt "1" ] && [ $domain -eq "$2" ] ; then # if we found more occurencies and we are checking desired level printf "add address=127.0.0.1 comment=\"ADBlock\" disabled=no regexp=\"$multi\"\n" >>./ADblock.rsc #write regexp rule into ADblock.rsc file else printf "add address=127.0.0.1 comment=\"ADBlock\" disabled=no name=\"$line\"\n" >>./ADblock.rsc #otherwise write name rule into ADblock.rsc file fi done < unique.tmp rm -f ./*.tmp #cleanup of temp files rm -f $1 #cleanup of original file
Is your feature request related to a problem?
There are a lot of records, which differs at "lowest-level" domain name only. For example - here is difference at 4-th level only: 10148.engine.mobileapptracking.com 10402.engine.mobileapptracking.com 10896.engine.mobileapptracking.com 13146.engine.mobileapptracking.com 13248.engine.mobileapptracking.com 14012.engine.mobileapptracking.com 15486.engine.mobileapptracking.com ... ... ...
Describe the solution you'd like
Is it possible to add global "checkbox" (use/dont use) and global "numeric up/down" (strip "n-th" level and up) selector from which level of domain name to merge these records, so result for level 4 and up will be just 1 line record like (using REGEXP rule):
add address=127.0.0.1 comment="ADBlock" disabled=no regexp="engine.mobileapptracking.com"
Additional context
I hope this can significantly reduce "rules count" for mikrotik.
e.g.: how about WEB translation to English?
Thanks
Jan