themains / dmoz_csv

Convert DMOZ content.rdf.u8.gz into a CSV file
1 stars 0 forks source link

CSV output #2

Closed soodoku closed 6 years ago

soodoku commented 7 years ago

Protect strings with commas with quotes

Some domains have multiple category labels. Each label is separated by a comma. But these strings with commas are not always enclosed in quotes. For instance,

www.tanvald.cz,Top/World/Deutsch/Regional/Europa/Tschechien/Regionen/Reichenberg/Tanvald,Top/World/Česky/Státy_a_regiony/Evropa/Česká_republika/Kraje/Liberecký/Tanvald www.wstyler.com,Top/Business/Mining_and_Drilling/Tools_and_Equipment/Mining,Top/Regional/North_America/United_States/Ohio/Localities/M/Mentor/Business_and_Economy/Manufacturing,Top/Business/Industrial_Goods_and_Services/Cable_and_Wire/Wire_Mesh

Suggested fix: always protect category labels with quotes

suriyan commented 7 years ago

Regarding to the original output file format :-

The structure of the file is

"URL","Category 1","Category 2",..........

Multiple categories spread to separated columns. In this case would you like to combined to one column?

soodoku commented 7 years ago

Aah --- got it!

The surprising thing is that some strings with multiple categories are protected by a quote, some not. Maybe there is a logic to it.

But it may make sense to quote everytime we have multiple categories rather than put them into separate columns.

suriyan commented 7 years ago

I agree with you to have put multiple categories to a column. However seems the category label may contains comma in it. So should we use "|" (pipe) as categories separator instead?

soodoku commented 7 years ago

Interesting. Didn't realize that a single category label can also have a comma. If so, yeah, lets go with some fancy delimiter. Pipe/semi-colon --- whatever doesn't exist in the data. Thanks!