tobie / ua-parser

A multi-language port of Browserscope's user agent parser.
Other
1.97k stars 497 forks source link

J2ME phone detection #126

Closed selwin closed 11 years ago

selwin commented 11 years ago

J2ME phones should probably be detected as "Generic Feature Phone", currently ua-parser returns None as the device family.


user_agent_string = 'Opera/9.80 (J2ME/MIDP; Opera Mini/9.80 (J2ME/22.478; U; en) Presto/2.5.25 Version/10.54'

parser.Parse(user_agent_string)

{
    "device": {
        "family": None
    },
    "os": {
        "family": "Other",
        "major": None,
        "minor": None,
        "patch": None,
        "patch_minor": None
    },
    "string": "Opera/9.80 (J2ME/MIDP; Opera Mini/9.80 (J2ME/22.478; U; en) Presto/2.5.25 Version/10.54",
    "user_agent": {
        "family": "Opera Mini",
        "major": "9",
        "minor": "80",
        "patch": None
    }
}
dmolsen commented 11 years ago

Interesting... the PHP lib returns it as a "Generic Smartphone" which, looking at the regexes, is correct. Maybe this is a library specific bug?

selwin commented 11 years ago

I'm seeing this on the master branch, will try to look at this in more detail this weekend.

tobie commented 11 years ago

This is because some of the regexes are case sensitive while others aren't.

dmolsen commented 11 years ago

That stuff came out of my original build and I do case insensitive matching hence why it works with the PHP lib. In the case of MIDP I can add the uppercase version to the regex. I can review the other smartphone choices as well and see if I can get the correct case for them. I'll dig up test cases.

tobie commented 11 years ago

I think we should all agree on one way of doing it, else we 're going to get lots of edge cases like this one. My preference would be to be case sensitive all the way (and thus do case sensitive matching inside the different ports too). @elsigh, thoughts?

dmolsen commented 11 years ago

All of the search engine regexes will have to be reviewed for case sensitivity too. I should be able to use some of the PHP command line features with my work Apache log files to quickly narrow down those issues. We'll just have to accept that the feature phone regex will be hit or miss tho maybe that can also be addressed with some healthy log parsing as well.

tobie commented 11 years ago

Well, either way, we should go down one road and fix the rest of the regexes to comply, don't you think?

tobie commented 11 years ago

Looked into this a bit more. Here are the regexp strings (with matching line numbers) which don't contain mixed casing (so could be case insensitive.) IMHO the first ones are false positives:

 77 '(luakit)'
 93 'rekonq'
102 '(konqueror)/(\d+)\.(\d+)\.(\d+)'
129 '(chromeframe)/(\d+)\.(\d+)\.(\d+)'
155 '(facebookexternalhit)/(\d+)\.(\d+)'
335 '(python-requests)/(\d+)\.(\d+)'
825 '(hiptop|avantgo|plucker|xiino|blazer|elaine|up.browser|up.link|mmp|smartphone|midp|wap|vodafone|o2|pocket|mobile|pda)'
831 '^(1207|3gso|4thp|501i|502i|503i|504i|505i|506i|6310|6590|770s|802s|a wa|acer|acs\-|airn|alav|asus|attw|au\-m|aur |aus |abac|acoo|aiko|alco|alca|amoi|anex|anny|anyw|aptu|arch|argo|bell|bird|bw\-n|bw\-u|beck|benq|bilb|blac|c55/|cdm\-|chtm|capi|comp|cond|craw|dall|dbte|dc\-s|dica|ds\-d|ds12|dait|devi|dmob|doco|dopo|el49|erk0|esl8|ez40|ez60|ez70|ezos|ezze|elai|emul|eric|ezwa|fake|fly\-|fly_|g\-mo|g1 u|g560|gf\-5|grun|gene|go.w|good|grad|hcit|hd\-m|hd\-p|hd\-t|hei\-|hp i|hpip|hs\-c|htc |htc\-|htca|htcg)'
833 '^(htcp|htcs|htct|htc_|haie|hita|huaw|hutc|i\-20|i\-go|i\-ma|i230|iac|iac\-|iac/|ig01|im1k|inno|iris|jata|java|kddi|kgt|kgt/|kpt |kwc\-|klon|lexi|lg g|lg\-a|lg\-b|lg\-c|lg\-d|lg\-f|lg\-g|lg\-k|lg\-l|lg\-m|lg\-o|lg\-p|lg\-s|lg\-t|lg\-u|lg\-w|lg/k|lg/l|lg/u|lg50|lg54|lge\-|lge/|lynx|leno|m1\-w|m3ga|m50/|maui|mc01|mc21|mcca|medi|meri|mio8|mioa|mo01|mo02|mode|modo|mot |mot\-|mt50|mtp1|mtv |mate|maxo|merc|mits|mobi|motv|mozz|n100|n101|n102|n202|n203|n300|n302|n500|n502|n505|n700|n701|n710|nec\-|nem\-|newg|neon)'
835 '^(netf|noki|nzph|o2 x|o2\-x|opwv|owg1|opti|oran|ot\-s|p800|pand|pg\-1|pg\-2|pg\-3|pg\-6|pg\-8|pg\-c|pg13|phil|pn\-2|pt\-g|palm|pana|pire|pock|pose|psio|qa\-a|qc\-2|qc\-3|qc\-5|qc\-7|qc07|qc12|qc21|qc32|qc60|qci\-|qwap|qtek|r380|r600|raks|rim9|rove|s55/|sage|sams|sc01|sch\-|scp\-|sdk/|se47|sec\-|sec0|sec1|semc|sgh\-|shar|sie\-|sk\-0|sl45|slid|smb3|smt5|sp01|sph\-|spv |spv\-|sy01|samm|sany|sava|scoo|send|siem|smar|smit|soft|sony|t\-mo|t218|t250|t600|t610|t618|tcl\-|tdg\-|telm|tim\-|ts70|tsm\-|tsm3|tsm5|tx\-9|tagt)'
837 '^(talk|teli|topl|tosh|up.b|upg1|utst|v400|v750|veri|vk\-v|vk40|vk50|vk52|vk53|vm40|vx98|virg|vite|voda|vulc|w3c |w3c\-|wapj|wapp|wapu|wapm|wig |wapi|wapr|wapv|wapy|wapa|waps|wapt|winc|winw|wonu|x700|xda2|xdag|yas\-|your|zte\-|zeto|aste|audi|avan|blaz|brew|brvw|bumb|ccwa|cell|cldc|cmd\-|dang|eml2|fetc|hipt|http|ibro|idea|ikom|ipaq|jbro|jemu|jigs|keji|kyoc|kyok|libw|m\-cr|midp|mmef|moto|mwbp|mywa|newt|nok6|o2im|pant|pdxg|play|pluc|port|prox|rozo|sama|seri|smal|symb|treo|upsi|vx52|vx53|vx60|vx61|vx70|vx80|vx81|vx83|vx85|wap\-|webc|whit|wmlb|xda\-|xda_)'
843 '(bot|borg|google(^tv)|yahoo|slurp|msnbot|msrbot|openbot|archiver|netresearch|lycos|scooter|altavista|teoma|gigabot|baiduspider|blitzbot|oegp|charlotte|furlbot|http%20client|polybot|htdig|ichiro|mogimogi|larbin|pompos|scrubby|searchsight|seekbot|semanticdiscovery|silk|snappy|speedy|spider|voila|vortex|voyager|zao|zeal|fast\-webcrawler|converacrawler|dataparksearch|findlinks)' 
tobie commented 11 years ago

I'm considering this is strictly a regex case sensitivity issue and that only the last 6 regexes should be fixed (unless @elsigh advises otherwise). I'm opening a new issue for this task (#134) and assigning it to @dmolsen.