john-kurkowski / tldextract

Accurately separates a URL’s subdomain, domain, and public suffix, using the Public Suffix List (PSL).
BSD 3-Clause "New" or "Revised" License
1.81k stars 211 forks source link

v5.0.0 tldextract.extract not working like v4.0.0 on pandas dataframe as #307

Closed jeffreyorourke closed 8 months ago

jeffreyorourke commented 8 months ago

data = ["https://url1.com","http://url2.com","url3.com"] df = pd.DataFrame(data, columns=['urls']) df['extracted'] = df.urls.apply(lambda x: '.'.join(tldextract.extract(x)[1:3])) # add column with top level domain url df

in v4.0.0 this produces:

index urls extracted
0 https://url1\.com url1.com
1 http://url2\.com url2.com
2 url3.com url3.com

in v5.0.0 this results in the following error:


TypeError Traceback (most recent call last) in <cell line: 6>() 4 data = ["https://url1.com/","http://url2.com/","url3.com"] 5 df = pd.DataFrame(data, columns=['urls']) ----> 6 df['extracted'] = df.urls.apply(lambda x: '.'.join(tldextract.extract(x)[1:3])) # add column with top level domain url 7 df

4 frames /usr/local/lib/python3.10/dist-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwargs) 4769 dtype: float64 4770 """ -> 4771 return SeriesApply(self, func, convert_dtype, args, kwargs).apply() 4772 4773 def _reduce(

/usr/local/lib/python3.10/dist-packages/pandas/core/apply.py in apply(self) 1121 1122 # self.f is Callable -> 1123 return self.apply_standard() 1124 1125 def agg(self):

/usr/local/lib/python3.10/dist-packages/pandas/core/apply.py in apply_standard(self) 1172 else: 1173 values = obj.astype(object)._values -> 1174 mapped = lib.map_infer( 1175 values, 1176 f,

/usr/local/lib/python3.10/dist-packages/pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

in (x) 4 data = ["https://url1.com/","http://url2.com/","url3.com"] 5 df = pd.DataFrame(data, columns=['urls']) ----> 6 df['extracted'] = df.urls.apply(lambda x: '.'.join(tldextract.extract(x)[1:3])) # add column with top level domain url 7 df

TypeError: 'ExtractResult' object is not subscriptable

john-kurkowski commented 8 months ago

TypeError: 'ExtractResult' object is not subscriptable

Yes, that is expected in 5.0.0. See the breaking changes in the changelog.

In your case, you might want something like the following.

def my_extract(url: str) -> str:
    ext = tldextract.extract(url)
    return '.'.join((ext.domain, ext.suffix))

f['extracted'] = df.urls.apply(my_extract)