qri-io / qri

you're invited to a data party!
https://qri.io
GNU General Public License v3.0
1.1k stars 66 forks source link

bug(cmd save): qri save doesn't detect changes in body #1383

Open Arqu opened 4 years ago

Arqu commented 4 years ago

Environment

OS info

ProductName:    Mac OS X
ProductVersion: 10.15.2
BuildVersion:   19C57

qri version

0.9.9-dev @ 47415190d37e5f25a8b3a3c60950467809ab2745

Expected Behavior

mkdir population_bank_world
cd population_bank_world
qri init
cp ../body.csv ./body.csv
qri save
echo "Zimbabwe,ZWE,\"Population, total\",SP.POP.TOTL,3776681,3905034,4039201,4178726,4322861,4471177,4623351,4779827,4941906,5111337,5289303,5476982,5673911,5877726,6085074,6293875,6502569,6712827,6929664,7160023,7408624,7675591,7958241,8254747,8562249,8877489,9200149,9527203,9849125,10153852,10432421,10680995,10900502,11092766,11261744,11410714,11541217,11653242,11747072,11822719,11881477,11923914,11954290,11982224,12019912,12076699,12155491,12255922,12379549,12526968,12697723,12894316,13115131,13350356,13586681,13814629,14030390,14236745,14439018,," >> ./body.csv
qri save

Expected to produce:

for linked dataset [arqu/population_bank_world]

dataset saved: arqu/population_bank_world@/ipfs/QmVx438tG5FZLM2xLaz9YD57RPSVhDR2fFv8QfaDuGRaok

Actual Behavior

for linked dataset [arqu/population_bank_world]

error saving: no changes

Other info

dustmop commented 4 years ago

Tried to reproduce but the save worked for me. The steps you gave, do they reliably work when run using a brand new user with an empty repo? (I used qriswitch to test those steps)

Other questions, after you've modified the body, right at the point when you run the second qri save, what do qri status and qri diff produce? What about qri log?

Arqu commented 4 years ago
$> qri log
1   Commit:  /ipfs/QmVx438tG5FZLM2xLaz9YD57RPSVhDR2fFv8QfaDuGRaok
    Date:    Wed Jun 10 11:20:46 CEST 2020
    Storage: local
    Size:    110 kB

    created dataset
$> qri status
for linked dataset [koka/population_bank_world]

  modified: body (source: body.csv)

run `qri save` to commit this dataset
$> qri diff
+330 elements. 0 inserts. 0 deletes.

body: [["Afghanistan","AFG","Population, total","SP.POP.TOTL",8996973,9169410,9351441,9543205,9744781,9956320,10174836,10399926,10637063,10893776,11173642,11475445,11791215,12108963,12412950,12689160,12943093,13171306,13341198,13411056,13356511,13171673,12882528,12537730,12204292,11938208,11736179,11604534,11618005,11868877,12412308,13299017,14485546,15816603,17075727,18110657,18853437,19357126,19737765,20170844,20779953,21606988,22600770,23680871,24726684,25654277,26433049,27100536,27722276,28394813,29185507,30117413,31161376,32269589,33370794,34413603,35383128,36296400,37172386,"",""],["Angola","AGO","Population, total","SP.POP.TOTL",5454933...
$> qri save
for linked dataset [koka/population_bank_world]

error saving: no changes

User is fresh, IPFS might be dirty. Will try to do another clean run. So far can reliably reproduce every time.

Edit: NVM, seems like IPFS is the issue. Was plenty dirty. Got the old repo backed up if you want to play but I'm happy to close this as I've fooled around plenty with it before this happened.

dustmop commented 4 years ago

Sure, send me along the backed up repo. Though it'll probably be lower priority if it's hard to reproduce. How were you using a fresh user but an existing ipfs repo?

I'm curious what causes this sort of thing to happen, and if there's a better way we can handle the error, instead of silently failing. Do you know what caused your ipfs repo to get into the "dirty" state that you believe was the root cause?

Arqu commented 4 years ago

I was manually re-doing my user but kept the .ipfs folder around. I'd say the probably biggest culprit is tons of the same user settings recreated while testing a local registry during development.

Agree it would be good to know how it can detect a diff but on save return there's no change. Other than that I'd just leave it on the back burner. I'll keep the backup around if this ever crops up and we can try to get to the bottom of it.

Arqu commented 4 years ago

So I'm re-opening this as I've managed to encounter this with an empty repo. Here's a branch that logs out the following (--log-all) https://github.com/qri-io/qri/commit/217b8dbd99b8a84c8c09c59124468d737f26bcba

$> qri save
for linked dataset [hepek/wbp]

⠋ DiffDescriptions 0
DiffDescriptions 1
buildComponentChanges HD:9, BD:3
buildComponentChanges type:replace
buildComponentChanges initCompChanges
buildComponentChanges type:+
buildComponentChanges initCompChanges
buildComponentChanges type:+
buildComponentChanges initCompChanges
buildComponentChanges assumeBody NOT Changed
buildBodyChanges parentPath: ''
buildBodyChanges delta: ' '
buildBodyChanges delta: ' '
buildBodyChanges delta: ' '
DiffDescriptions comp:meta
DiffDescriptions comp:structure
DiffDescriptions comp:readme
DiffDescriptions comp:viz
DiffDescriptions comp:transform
DiffDescriptions comp:body
DiffDescriptions 2
DiffDescriptions 3
error saving: no changes

$> qri save -m test produces the same result

Attached are the repo, config and dataset. bugdrop.zip

Notable differences in setup:

dustmop commented 4 years ago

Still unable to reproduce this. But looks at the logging output you added I'm noticing the following:

buildComponentChanges HD:9, BD:3
...
buildBodyChanges parentPath: ''
buildBodyChanges delta: ' '
buildBodyChanges delta: ' '
buildBodyChanges delta: ' '

This means that the bodyDeltas has 3 deltas in it, which is what gets returned from deepdiff. Iterating over those deltas, your code is printing the delta.Type, which is the ' '. This means, of course, DTContext https://github.com/qri-io/deepdiff/blob/master/delta.go#L13, which is represented by a space character and is used when nothing is different between the first and second values passed to deepdiff. So for whatever reason, the bodys that are handed off to deepdiff result in a diff with no changes.


One other thing, just to prevent following any incorrect leads, this part of the log:

buildComponentChanges assumeBody NOT Changed

should actually be stated as:

buildComponentChanges NOT assumeBodyChanged

in other words, we shouldn't assume the body has changed, because we should to inspect the result of deepdiff to determine that fact. The assumeBodyChanged is only true when the body is too large to diff, which is not the case here.

Arqu commented 4 years ago

So got another update on this. I can reliably reporduce this from a fresh repo. If the initial body is:

one,two,3
four,five,6

And I update it to:

one,two,3
four,five,6
four,five,6

It doesn't detect the changes, however if I instead update it to:

one,two,3
four,five,6
four,five,7

It works and saves correctly.

Now I'm not entirely sure to why it is happening (especially since we accurately detect the presence of a change on qri status), however my hunch would be that for status we only have a lightweight check like size/hash equality while the save path includes the differ to infer all the subesquent changes.

Now the differ might fail on the above if the rows are the same as it doesn't have enough context bellow/above and figures it is "the same row" and thus doesn't detect a "real" change in the contents. If you change the value, it obviously detects the row diff and then works as expected.

@dustmop @b5 any thoughts on this?

Arqu commented 4 years ago

Just to add newer info, seems like if I do qri save any number of times it fails. However if I do qri status and then qri save it picks up.

Arqu commented 4 years ago

Think this is another instance of this: https://github.com/qri-io/qri/issues/1505

dustmop commented 4 years ago

Would deeply enjoy fixing this bug, but likely it will require a rewrite, or at least a large overhaul of deepdiff. Part of the problem is that deepdiff contains non-determinism: https://github.com/qri-io/deepdiff/pull/7, but even aside from that, it returns incorrect results in certain known cases.

The save path uses deepdiff for 2 purposes: 1) determining whether something in the body has changed, and 2) deriving a textual message that describes what the change is. As long as deepdiff has these bugs, the save operation is going to behave incorrectly.

ramfox commented 3 years ago

hey just adding my own notation of seeing this bug, but running qri status before running qri save did not allow me to save

previous body:

one,two,3
four,five,6
seven,eight,9
ten,eleven,12

edited body:

one,two,3
four,five,6
seven,eight,9
ten,eleven,12
one,two,3
four,five,6
seven,eight,9
ten,eleven,12
one,two,3
four,five,6
seven,eight,9
ten,eleven,12
one,two,3
four,five,6
seven,eight,9
ten,eleven,12

output:

$ qri save --log-all
2020-10-06T20:06:45.408-0400    DEBUG   lib lib/lib.go:382  --log-all set: turning on logging for all activity
2020-10-06T20:06:45.447-0400    DEBUG   cmd cmd/qri.go:164  running cmd ["qri" "save" "--log-all"]
for linked dataset [ramfox_head/body]

2020-10-06T20:06:45.447-0400    DEBUG   base    base/dataset_prepare.go:38  PrepareSaveRef refStr="ramfox_head/body" bodyPathNameHint="" wantNewName=false
⠋ 2020-10-06T20:06:45.447-0400  DEBUG   base    base/dataset_prepare.go:101 PrepareSaveRef found previous initID="dkbbaysyxlxuxz3vq7ztc7r4o4h26gwqibrhiad5hq5svkc6ts4a"
2020-10-06T20:06:45.448-0400    DEBUG   base    base/save.go:25 SaveDataset initID="dkbbaysyxlxuxz3vq7ztc7r4o4h26gwqibrhiad5hq5svkc6ts4a" prevPath="/ipfs/QmZSWMCbjmFxxhWYC2pX1AGSiZunPS8g72xFF92yPBX4aU"
2020-10-06T20:06:45.448-0400    DEBUG   dsfs    dsfs/dataset.go:51  LoadDataset path="/ipfs/QmZSWMCbjmFxxhWYC2pX1AGSiZunPS8g72xFF92yPBX4aU"
2020-10-06T20:06:45.448-0400    DEBUG   dsfs    dsfs/dataset.go:77  LoadDatasetRefs path="/ipfs/QmZSWMCbjmFxxhWYC2pX1AGSiZunPS8g72xFF92yPBX4aU"
2020-10-06T20:06:45.448-0400    DEBUG   dsfs    dsfs/dataset.go:111 DerefDataset path="/ipfs/QmZSWMCbjmFxxhWYC2pX1AGSiZunPS8g72xFF92yPBX4aU"
2020-10-06T20:06:45.448-0400    DEBUG   dsfs    dsfs/dataset.go:51  LoadDataset path="/ipfs/QmZSWMCbjmFxxhWYC2pX1AGSiZunPS8g72xFF92yPBX4aU"
2020-10-06T20:06:45.448-0400    DEBUG   dsfs    dsfs/dataset.go:77  LoadDatasetRefs path="/ipfs/QmZSWMCbjmFxxhWYC2pX1AGSiZunPS8g72xFF92yPBX4aU"
2020-10-06T20:06:45.448-0400    DEBUG   dsfs    dsfs/dataset.go:111 DerefDataset path="/ipfs/QmZSWMCbjmFxxhWYC2pX1AGSiZunPS8g72xFF92yPBX4aU"
2020-10-06T20:06:45.449-0400    DEBUG   base    base/save.go:121    CreateDataset ds=&dataset.Dataset{bodyFile:(*localfs.LocalFile)(0xc00048dda0), Body:interface {}(nil), BodyBytes:[]uint8(nil), BodyPath:"/Users/ramfox/Work/datasets/body/body.csv", Commit:(*dataset.Commit)(0xc0005da230), Meta:(*dataset.Meta)(nil), Name:"body", Path:"/ipfs/QmZSWMCbjmFxxhWYC2pX1AGSiZunPS8g72xFF92yPBX4aU", Peername:"ramfox_head", PreviousPath:"/ipfs/QmZSWMCbjmFxxhWYC2pX1AGSiZunPS8g72xFF92yPBX4aU", ProfileID:"", Readme:(*dataset.Readme)(nil), NumVersions:0, Qri:"ds:0", Structure:(*dataset.Structure)(0xc000544320), Transform:(*dataset.Transform)(nil), Viz:(*dataset.Viz)(nil)} dsPrev=&dataset.Dataset{bodyFile:qipfs.ipfsFile{path:"/ipfs/Qmb8zEnXWaxNe83majrb9jmjUmAeat9do76AC1c1p8cb4i", r:(*unixfile.ufsFile)(0xc000284b50)}, Body:interface {}(nil), BodyBytes:[]uint8(nil), BodyPath:"/ipfs/Qmb8zEnXWaxNe83majrb9jmjUmAeat9do76AC1c1p8cb4i", Commit:(*dataset.Commit)(0xc0005dab60), Meta:(*dataset.Meta)(nil), Name:"", Path:"/ipfs/QmZSWMCbjmFxxhWYC2pX1AGSiZunPS8g72xFF92yPBX4aU", Peername:"", PreviousPath:"", ProfileID:"", Readme:(*dataset.Readme)(nil), NumVersions:0, Qri:"ds:0", Structure:(*dataset.Structure)(0xc000544140), Transform:(*dataset.Transform)(nil), Viz:(*dataset.Viz)(nil)}
2020-10-06T20:06:45.449-0400    DEBUG   dsfs    dsfs/dataset.go:254 CreateDataset
2020-10-06T20:06:45.449-0400    DEBUG   dsfs    dsfs/dataset.go:111 DerefDataset path="/ipfs/QmZSWMCbjmFxxhWYC2pX1AGSiZunPS8g72xFF92yPBX4aU"
2020-10-06T20:06:45.449-0400    DEBUG   dsfs    dsfs/dataset.go:111 DerefDataset path="/ipfs/QmZSWMCbjmFxxhWYC2pX1AGSiZunPS8g72xFF92yPBX4aU"
2020-10-06T20:06:45.449-0400    DEBUG   dsfs    dsfs/dataset.go:313 prepareDataset
2020-10-06T20:06:45.449-0400    DEBUG   dsfs    dsfs/dataset.go:520 set depth and entry count. depth=2 entries=16
2020-10-06T20:06:45.449-0400    DEBUG   dsfs    dsfs/dataset.go:567 set length and checksum. length=200 checksum="QmXieumd8M2v4zH6Egw5YmP9xUG9337SzXKbGtnhR5UTQx"
2020-10-06T20:06:45.450-0400    DEBUG   dsfs    dsfs/dataset.go:481 setErrorCount result ErrCount=0
2020-10-06T20:06:45.450-0400    DEBUG   dsfs    dsfs/dataset.go:578 generateCommitDescriptions
2020-10-06T20:06:45.450-0400    DEBUG   dsfs    dsfs/dataset.go:587 inlining body file to calulate a diff
2020-10-06T20:06:45.450-0400    DEBUG   dsfs    dsfs/dataset.go:773 calculating body statDiff type(prevBody)=[]interface {} type(nextBody)=[]interface {}
2020-10-06T20:06:45.451-0400    DEBUG   dsfs    dsfs/dataset.go:781 setting diff descriptions
2020-10-06T20:06:45.451-0400    DEBUG   friendly    friendly/friendly.go:26DiffDescriptions len(headDeltas)=9 len(bodyDeltas)=16 bodyStats=&{17 65 123 489 0 0 0} assumeBodyChanged=false
2020-10-06T20:06:45.451-0400    DEBUG   dsfs    dsfs/dataset.go:424 error saving: no changes
2020-10-06T20:06:45.451-0400    DEBUG   dsfs    dsfs/dataset.go:279 error saving: no changes
2020-10-06T20:06:45.451-0400    DEBUG   base    base/save.go:155    dsfs.CreateDataset: error saving: no changes
2020-10-06T20:06:45.451-0400    DEBUG   lib lib/datasets.go:652 create ds error: error saving: no changes