Open alexey-milovidov opened 4 years ago
The omnisci_server
process is using 100% of single CPU core,
according to perf top
, it is spending the time in LLVM generated code.
Hardware: Xeon E5-2650v2 (32 logical cores), 128 GiB RAM, 40 TB HDD in mdRAID-5.
Hi @alexey-milovidov,
I can't get the reason I would load a tsv file into clickhouse and export into a csv file, when I can load directly into omnisci database, just changing the separator from the default to the tab.
This is a link to the docs for the copy command
https://docs.omnisci.com/loading-and-exporting-data/command-line/export-data
Beside if that I will check what's wrong with this query ASAP; it's nota query where our database shine, but it should take forever
just changing the separator from the default to the tab
It would not work, we need to convert unsigned numbers to signed (OmniSci does not support BIGINT UNSIGNED) and turn string fields to valid UTF-8 (OmniSci does not support BLOB).
Hi @alexey-milovidov ,
just loaded the table with some fields using a different datatype (basically all id fields have been changed from bigint/integer to text encoded)
populated the table with the original TSV downloaded from Clickhouse's tutorial and loaded multiple times to have 106M records into the table with this copy command
copy hits_v1 from '/opt/root_ubuntu18/opt/opendata/hits_visit/hits_v1.tsv.xz' with (header='false', delimiter='\t', array_marker='[]', quoted='false');
the query wall time is with 2 GPUs is floating between 390 and 410 ms with an AMD Threadripper, 1920X CPU 12c/24t is floating between 2910 and 2960 ms (3 cores used) with an AMD Threadripper, 1920X CPU 12c/24t and a data balanced table (sharded) between 1790 and 1850 ms (7 cores used)
Probably the changes I did to the DDL improved the performance; You can check by yourself if the problems you are facing are gone trying this DDL to better evaluate omnisci database
CREATE TABLE hits_v1 (
WatchID BIGINT,
JavaEnable TINYINT,
Title TEXT ENCODING DICT(32),
GoodEvent SMALLINT,
EventTime TIMESTAMP(0) ENCODING FIXED(32),
EventDate DATE ENCODING DAYS(16),
CounterID INTEGER,
ClientIP TEXT ENCODING DICT(32),
ClientIP6 TEXT ENCODING DICT(32),
RegionID INTEGER,
UserID TEXT ENCODING DICT(32),
CounterClass TINYINT,
OS TINYINT,
UserAgent TINYINT,
URL TEXT ENCODING DICT(32),
Referer TEXT ENCODING DICT(32),
URLDomain TEXT ENCODING DICT(32),
RefererDomain TEXT ENCODING DICT(32),
Refresh TINYINT,
IsRobot TINYINT,
RefererCategories SMALLINT[],
URLCategories SMALLINT[],
URLRegions INTEGER[],
RefererRegions INTEGER[],
ResolutionWidth SMALLINT,
ResolutionHeight SMALLINT,
ResolutionDepth TINYINT,
FlashMajor TINYINT,
FlashMinor TINYINT,
FlashMinor2 TEXT ENCODING DICT(32),
NetMajor TINYINT,
NetMinor TINYINT,
UserAgentMajor SMALLINT,
UserAgentMinor TEXT ENCODING DICT(32),
CookieEnable TINYINT,
JavascriptEnable TINYINT,
IsMobile TINYINT,
MobilePhone TINYINT,
MobilePhoneModel TEXT ENCODING DICT(32),
Params TEXT ENCODING DICT(32),
IPNetworkID INTEGER,
TraficSourceID TINYINT,
SearchEngineID SMALLINT,
SearchPhrase TEXT ENCODING DICT(32),
AdvEngineID TINYINT,
IsArtifical TINYINT,
WindowClientWidth SMALLINT,
WindowClientHeight SMALLINT,
ClientTimeZone SMALLINT,
ClientEventTime TIMESTAMP(0) ENCODING FIXED(32),
SilverlightVersion1 TINYINT,
SilverlightVersion2 TINYINT,
SilverlightVersion3 INTEGER,
SilverlightVersion4 SMALLINT,
PageCharset TEXT ENCODING DICT(32),
CodeVersion INTEGER,
IsLink TINYINT,
IsDownload TINYINT,
IsNotBounce TINYINT,
FUniqID TEXT ENCODING DICT(32),
HID TEXT ENCODING DICT(32),
IsOldCounter TINYINT,
IsEvent TINYINT,
IsParameter TINYINT,
DontCountHits TINYINT,
WithHash TINYINT,
HitColor TEXT ENCODING DICT(32),
UTCEventTime TIMESTAMP(0) ENCODING FIXED(32),
Age TINYINT,
Sex TINYINT,
Income TINYINT,
Interests SMALLINT,
Robotness TINYINT,
GeneralInterests SMALLINT[],
RemoteIP TEXT ENCODING DICT(32),
RemoteIP6 TEXT ENCODING DICT(32),
WindowName INTEGER,
OpenerName INTEGER,
HistoryLength SMALLINT,
BrowserLanguage TEXT ENCODING DICT(32),
BrowserCountry TEXT ENCODING DICT(32),
SocialNetwork TEXT ENCODING DICT(32),
SocialAction TEXT ENCODING DICT(32),
HTTPError SMALLINT,
SendTiming INTEGER,
DNSTiming INTEGER,
ConnectTiming INTEGER,
ResponseStartTiming INTEGER,
ResponseEndTiming INTEGER,
FetchTiming INTEGER,
RedirectTiming INTEGER,
DOMInteractiveTiming INTEGER,
DOMContentLoadedTiming INTEGER,
DOMCompleteTiming INTEGER,
LoadEventStartTiming INTEGER,
LoadEventEndTiming INTEGER,
NSToDOMContentLoadedTiming INTEGER,
FirstPaintTiming INTEGER,
RedirectCount TINYINT,
SocialSourceNetworkID TINYINT,
SocialSourcePage TEXT ENCODING DICT(32),
ParamPrice BIGINT,
ParamOrderID TEXT ENCODING DICT(32),
ParamCurrency TEXT ENCODING DICT(32),
ParamCurrencyID SMALLINT,
GoalsReached INTEGER[],
OpenstatServiceName TEXT ENCODING DICT(32),
OpenstatCampaignID TEXT ENCODING DICT(32),
OpenstatAdID TEXT ENCODING DICT(32),
OpenstatSourceID TEXT ENCODING DICT(32),
UTMSource TEXT ENCODING DICT(32),
UTMMedium TEXT ENCODING DICT(32),
UTMCampaign TEXT ENCODING DICT(32),
UTMContent TEXT ENCODING DICT(32),
UTMTerm TEXT ENCODING DICT(32),
FromTag TEXT ENCODING DICT(32),
HasGCLID TINYINT,
RefererHash TEXT ENCODING DICT(32),
URLHash TEXT ENCODING DICT(32),
CLID INTEGER,
YCLID BIGINT,
ShareService TEXT ENCODING DICT(32),
ShareURL TEXT ENCODING DICT(32),
ShareTitle TEXT ENCODING DICT(32),
Key1 TEXT[],
Key2 TEXT[],
Key3 TEXT[],
Key4 TEXT[],
Key5 TEXT[],
ValueDouble DOUBLE[],
IslandID TEXT ENCODING DICT(32),
RequestNum INTEGER,
RequestTry TINYINT);
I execute the following query on CPU version of OmniSci:
in multiple runs. The first run completed successfully (in 30 seconds). But the second run cannot finish in more than a hour.
Table definition:
How to fill the table:
Download the "extended version of the hits table containing 100 million rows" from here: https://clickhouse.tech/docs/en/getting-started/example-datasets/metrica/
Insert into ClickHouse.
Transform to CSV with the following query:
Insert into OmniSci with the following query: