Closed Codelone closed 3 years ago
According to your nebula schema and Exchange configuration, I re-operate your process with the same Nebula Schema, the same data, the same Exchange configuration and the same submit command, but I cannot get the same error message. And the result looks normal and correct.
INFO [nebula.exchange.Exchange$$anonfun$main$3:apply:197] SST-Import: failure.address_id: 0
Can you please re-operate the process just for edge address_id?
this is only edge address_id edge.log actually,the vertex "idno" also has the same error,you can look the driver.log 。 Do you want to use my data ,but maybe too big。
this is only edge address_id edge.log actually,the vertex "idno" also has the same error,you can look the driver.log 。 Do you want to use my data ,but maybe too big。
We can try your big scale data, please upload your data here, thank you very much.
github not allow upload ,please download from dingding ,https://space.dingtalk.com/s/gwHOAxnJcALOEVwinwPaACBjNmFjNGNmYTRkYTg0MjAyYWM5MzFhYThiNTM2ODM4Mw 密码: aWVs
github not allow upload ,please download from dingding ,https://space.dingtalk.com/s/gwHOAxnJcALOEVwinwPaACBjNmFjNGNmYTRkYTg0MjAyYWM5MzFhYThiNTM2ODM4Mw 密码: aWVs
Thanks for the issue and the test data. We have reproduced the bug, and it is because that sst file requires strict ascending order, and your data has many common key, which cause the failure of writing sst file. For now, you can distinct your csv data according to vid (for edge, distinct it according to src & dst). We will fix it soon. Thanks again.
now, The key of data must be unique ?
now, The key of data must be unique ?
Yes. If you are not urgent, you can wait for our fix pr, working on it.
https://github.com/vesoft-inc/nebula-spark-utils/pull/150 fixed this issue.
(for edge, distinct it according to src & dst).
I notice that the edges are duplicated according to the src and dst, and the different rank maybe error,Is it possible?
(for edge, distinct it according to src & dst).
I notice that the edges are duplicated according to the src and dst, and the different rank maybe error,Is it possible?
your edge config in Exchange config file does not define the rank field. Its enough to distinct edge data according to src and dst.
(for edge, distinct it according to src & dst).
I notice that the edges are duplicated according to the src and dst, and the different rank maybe error,Is it possible?
your edge config in Exchange config file does not define the rank field. Its enough to distinct edge data according to src and dst.
if my data have rank ,Is it necessary to ensure that src and dst do not duplicate or just ensure src ,dst and rank not duplicate ?in the real data,we will hava rank, and maybe more the same src,dst but rank different。
I have already merge my branch(fork from v2.5) with #150, but this problem is still existed when exchange generate edge sst files. However, when I used another way to avoid key duplicated like below:
class NebulaSSTWriter(path: String) extends Writer {
...
private var lastKey: Array[Byte] = Array[Byte](0)
def write(key: Array[Byte], value: Array[Byte]): Unit = {
if (!key.sameElements(lastKey)) {
writer.put(key, value)
}
lastKey = key
}
}
By Using this, My problem was disappear. It's wield. Im using hash policy when generate key , is it because hash collision?
this is spark job log driver.log space sst_test tag idno data example
230421197906123305,0,客户0,0,2021/1/1,女,博士,离异,1,black 230421197906123305,1,客户1,1,2021/1/1,女,博士,离异,1,black 230421197906123305,2,客户2,2,2021/1/1,女,博士,离异,1,black 533101196807196696,3,客户3,3,2021/1/1,女,博士,离异,1,black 533101196807196696,4,客户4,4,2021/1/1,女,博士,离异,1,black 220722198306264943,5,客户5,5,2021/1/1,女,博士,离异,1,black 220722198306264943,6,客户6,6,2021/1/1,女,博士,离异,1,black 220722198306264943,7,客户7,7,2021/1/1,女,博士,离异,1,black 310200197802274261,8,客户8,8,2021/1/1,女,博士,离异,1,black 310200197802274261,9,客户9,9,2021/1/1,女,博士,离异,1,black
edge address_id data example `贵州省黔东南苗族侗族自治州榕江县镜湖花园22栋177号,230421197906123305,2019-02-20 08:02:20贵州省黔东南苗族侗族自治州榕江县镜湖花园22栋177号,230421197906123305,2019-01-30 13:26:01 贵州省黔东南苗族侗族自治州榕江县镜湖花园22栋177号,230421197906123305,2019-03-31 03:19:30 四川省泸州市龙马潭区锋尚名居8栋734号,533101196807196696,2019-05-14 22:31:58 四川省泸州市龙马潭区锋尚名居8栋734号,533101196807196696,2020-02-20 06:09:22 河北省邢台市巨鹿县市聚福园16栋228号,220722198306264943,2020-12-11 12:23:53 河北省邢台市巨鹿县市聚福园16栋228号,220722198306264943,2020-01-08 05:19:52 河北省邢台市巨鹿县市聚福园16栋228号,220722198306264943,2020-04-29 09:23:55 浙江省丽水地区青田县尚书苑8栋271号,310200197802274261,2019-09-03 01:40:32 浙江省丽水地区青田县尚书苑8栋271号,310200197802274261,2019-04-23 18:37:49`
application-sst.conf
run command
./spark-2.4.8-bin-hadoop2.6/bin/spark-submit --master yarn-client --class com.vesoft.nebula.exchange.Exchange nebula-exchange-2.5-SNAPSHOT.jar -c application-sst.conf