vesoft-inc / nebula-importer

Nebula Graph Importer with Go
Apache License 2.0
90 stars 60 forks source link

Importing VertexID with UTF8 characters in CSV file #257

Closed goranc closed 1 year ago

goranc commented 1 year ago

Importing data with escaped UTF8 characters for string type VertexID is not converting input string to UTF8 character, but inserts escaped characters as is from CSV file.

My Nebula cluster is using 3.3.0 version. VertexID is fixed string type with 28 characters length. I'm using custom algorithm for VertexID to avoid collision. It is combination of Lexicographic prefix based on string which have 8 characters length and concatenated with hash (standard Nebula hash function) converted to string.

Steps to reproduce the behavior:

Create space with VertexID definition as fixed string which use 28 characters Create TAG for URL vertex Import data for URL with UTF8 having specific characters

Examples: CREATE SPACE IF NOT EXISTS graph(partition_num=128, replica_factor=3, vid_type=fixed_string(28));

USE graph;

CREATE TAG url(link string, subdomain_name string, domain_name string, protocol string, classification string);

Try to import data into TAG

"stubhub\xe6-2541048767624938324": ("http://stubhub手数料3.xyz","stubhub手数料3.xyz","stubhub手数料3.xyz","http",""), "download1336853390718461484": ("http://downloads.sourceforge.net/project/orz123/a23.mp3?r=&ts=1448325706&use_mirror=heanet","downloads.sourceforge.net","sourceforge.net","http",""), "oss.jfro1186231920510779202": ("https://oss.jfrog.org/artifactory/jcenter-remote/com/google/apis/google-api-services-cloudkms/v1-rev20-1.21.0/google-api-services-cloudkms-v1-rev20-1.21.0.jar","oss.jfrog.org","jfrog.org","https",""); ErrMsg: Storage Error: The VID must be a 64-bit integer or a string fitting space vertex id length limit., ErrCode: -1005

We expect to get specific characters in VertexID field as it is in Domain field, but instead it is not converted and we got error about VertexID exceeded length.

wey-gu commented 1 year ago

cc @veezhang

veezhang commented 1 year ago

@goranc Hi, can you paste your csv data here?

goranc commented 1 year ago

OK, I've invested a little bit more time with this issue.

What is changed from previous testing is that now I'm handling strings to be complete UTF8 characters, so we avoid using escaping sequences started with hexadecimal escape codes (like \xe6 in previous example) and that can be completely different feature which can be provided.

So let's concentrate on importing regular UTF8 strings as VertexID. The issue here is that VertexID is limited with fixed string size and if we use UTF8 characters they can have length in bytes more than 2 bytes, like it was case with Chinese, Russian, Japanese and other character sets. Those characters have 3 bytes in size and cause to overflow VertexID in size, and that shouldn't be the case if we defined that VertexID is 28 characters in size.

You can try to import this Domain data I've got errors for, and see that this is the case here with size.

Tag definition for this records is:

CREATE SPACE IF NOT EXISTS graph(partition_num=128, replica_factor=3, vid_type=fixed_string(28));
USE graph;
CREATE TAG domain(name string, classification string, active bool);

And the Insert commands which have issues is like in this example:

[ERROR] handler.go:63: Client 8 fail to execute: 

INSERT VERTEX `domain`(`name`,`classification`,`active`) VALUES  
"neuroeco-2725713350576147783": ("neuroeconomia.com.br","",true), "majortoo-1093788498676804281": ("majortool.website","",true), 
"f1024pro7198050941800472293": ("f1024proku.cn","",true), "pixers.p-6753124849544050968": ("pixers.pl","",true), 
"iojet.co9111344562419197580": ("iojet.com","",true), "christba-5296937626571511539": ("christbaumservice.de","",true), 
"badgerfa5077377065993132103": ("badgerfarms.com","",true), "adventpo6252634886331700143": ("adventpowerprotection.com","",true), 
"davidsto5677837830947720780": ("davidstout.net","",true), "nicolasp-2162137739921036202": ("nicolaspoggi.com","",true), 
"intercon-6026190726284770122": ("intercontb.com","",true), "exclusiv-7590762108140403075": ("exclusiveagencyofficial.com","",true), 
"kawn.inf3823846560429494917": ("kawn.info","",true), "cengocen-5500962025655561744": ("cengocengo.github.io","",true), 
"中醫中藥cn.t4508929864515433325": ("中醫中藥cn.top","",true), "zomerter-2623832337387839851": ("zomerterras50bar.com","",true);

, ErrMsg: Storage Error: The VID must be a 64-bit integer or a string fitting space vertex id length limit., ErrCode: -1005
goranc commented 1 year ago

Just to explain this hybrid VertexID structure.

It is combination of Lexicographic prefix and hashing function, so we use that to avoid collisions in the graph space.

The VertexID is generated based on TAG Property domain.name as: substring(domain.name,1,8) + toString(hash(domain.name))

Note: I think this is an issue with Nebula VertexID and internal definition about string size, not only connected with importing data from CSV files.

wey-gu commented 1 year ago

Dear @goranc

Sorry @whitewum was not aware that you cannot read Chinese, we have this screen capture in Chinese Documentation mentioning that one Chineses UTF-8 char is 3-byte(may be not as you expected when calculating its length?)

As the following:

(root@nebula) [nba]> show create space nba
+-------+-------------------------------------------------------------------------------------------------------------------------------+
| Space | Create Space                                                                                                                  |
+-------+-------------------------------------------------------------------------------------------------------------------------------+
| "nba" | "CREATE SPACE `nba` (partition_num = 7, replica_factor = 1, charset = utf8, collate = utf8_bin, vid_type = FIXED_STRING(32))" |
+-------+-------------------------------------------------------------------------------------------------------------------------------+
Got 1 rows (time spent 989/28945 us)

# 11 chinese utf8 chars
(root@nebula) [nba]> insert vertex player(name,age) values "中中中中中中中中中中中":('length_11_chinese_utf8', 42);
[ERROR (-1005)]: Storage Error: The VID must be a 64-bit integer or a string fitting space vertex id length limit.

Thu, 12 Jan 2023 09:55:35 CST

# 10 chinese utf8 chars + 2 ascii chars
(root@nebula) [nba]> insert vertex player(name,age) values "中中中中中中中中中中01":('length_10_and_2_chinese_utf8', 42);
Execution succeeded (time spent 1257/25011 us)

Thu, 12 Jan 2023 09:55:51 CST

In this case the length of "中醫中藥cn.t4508929864515433325" is actually 35

In [7]: 3 * len("中醫中藥") + len("cn.t4508929864515433325")
Out[7]: 35

We will cover this info to en documentation later, sorry for this.

goranc commented 1 year ago

Ok, it is clear now, what is behind the scene. So we just need to be aware of that, and it is good to be included in documentation and explains, like we have it here in our discussion, with examples including specific multibyte characters.

wey-gu commented 1 year ago

Thanks @goranc do you think this patch to doc is enough or?

https://github.com/vesoft-inc/nebula-docs/pull/1871/files

wey-gu commented 1 year ago

closing it, thanks @goranc !