Bugfix; incorrect length was used to check for discarded areas

pontus commented 10 months ago

This fixes a bug that could lead to incorrect data being returned in some cases.

When doing reads, we determine the largest amount of data we can read in a single swoop to check that area for holes caused by any edit lists. This is the lowest number of the length of the slice we should store data in and the readable data left in the buffer.

The computation incorrectly used the number of bytes that had been read rather than bytes that can be read, leading to the incorrect part of the logical decrypted stream being checked for holes.

Checking an area that is too large is not a problem, but if the area to check is too small, it could lead to incorrect data being returned if the smaller bit does not contain holes but the correctly calculated (based on the amount of bytes that can be read from the decrypted buffer) bit does.

As an added precaution, this PR also makes sure to not try to read more than the calculated bit.

codecov-commenter commented 10 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (7c6f163) 66.24% compared to head (8ec0094) 66.29%.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #103 +/- ## ========================================== + Coverage 66.24% 66.29% +0.05% ========================================== Files 6 6 Lines 1176 1178 +2 ========================================== + Hits 779 781 +2 Misses 278 278 Partials 119 119 ``` | [Flag](https://app.codecov.io/gh/neicnordic/crypt4gh/pull/103/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=neicnordic) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/neicnordic/crypt4gh/pull/103/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=neicnordic) | `66.29% <100.00%> (+0.05%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=neicnordic#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

teemukataja commented 10 months ago

Our issue still seems to persist: attempting to get a larger chunk returns less data than requested (generated a 100MB test file for this).

    dataEditListHeaderPacket := headers.DataEditListHeaderPacket{
        PacketType:    headers.PacketType{PacketType: headers.DataEditList},
        NumberLengths: 2,
        Lengths:       []uint64{10000000, 30000000},
    }

This is expected to return 20_000_000 bytes, but it returns 616832.

Running on v1.7.1 returns the 20MB chunk as expected.

pontus commented 10 months ago

30M I assume here? Is this exact code that gives that result or is that something you can share? Similarly, what's the exact size in bytes of the file here (can it be shared easily)?

Is this with a separate testprogram or through sda-download?

teemukataja commented 10 months ago

The code is the unit test in this PR, but instead of sample.txt I used fallocate -l 100M 100mb.bin which is of size 100_000_000.

This is how we used the data edit list to decrypt parts of the file between specific coordinates, and it worked like this up to version 1.7.1:

startCoordinate=10000000&endCoordinate=30000000

NumberLengths: 2
Lengths: []uint64{10000000, 30000000}

provided us with 20000000 bytes between those locations (from X to Y).

Are you saying the data edit list in and before that version was not working correctly? How would it then be used to read a section between X and Y bytes? And how does it work with your example with 3 Lengths Lengths: []uint64{0, 100, 300}?

pontus commented 10 months ago

Yes, that sounds incorrect. Data edit list handling is described in 4.2 in https://samtools.github.io/hts-specs/crypt4gh.pdf.

My "simple" interpretation of that is alternating values of how much to discard and how much to keep, with a missing final keep being interpreted as copy the rest "until the end of file"

Anyway, those values being counters rather than file offsets means the data edit list as above would discard the first 10000000 bytes, provide the next 30000000 and then nothing more, so you should get what is between file offset 10000000 and 40000000, yielding 30000000 bytes.

But looking at https://github.com/neicnordic/sensitive-data-archive/blob/main/sda-download/api/sda/sda.go#L262-L264, it seems this was done correctly in sda-download.

I'll try experimenting a bit running with running the unit tests with a larger file size for the sample file (right now I'm seeing a failure because of a non-failure, but that being in the writer hasn't been touched for a long time).

For extra clarity: odd number of lengths in a data edit list means that the rest of the file will be provided after the last discard, so [0,100,300] means skip 0 bytes (so copy from start), transfer 100 bytes, skip 300 and copy the rest.

pontus commented 10 months ago

So, one of the tests had bad assumptions and has been fixed. There's also a test that (for better or worse) depends on the content being what it is in the repo and will fail with an empty file.

With the fixes in this branch, the tests work fine after extending test/sample.txt to half a gigabyte.

teemukataja commented 10 months ago

Ok, it makes sense now, thank you! :sweat_smile:

blankdots commented 10 months ago

I think we need to revert this, normal decryption does not work at all

steps to reproduce:

go run main.go encrypt -f=sample.txt -p=key.pub.pem

go run main.go decrypt -f=sample.txt.c4gh -s=key.sec.pem
#ends up in a loop

teemukataja commented 10 months ago

I think we need to revert this, normal decryption does not work at all

steps to reproduce:
go run main.go encrypt -f=sample.txt -p=key.pub.pem

go run main.go decrypt -f=sample.txt.c4gh -s=key.sec.pem
#ends up in a loop

confirmed, decryption hangs without producing anything

pontus commented 10 months ago

Oops. Yes, I guess it's best to revert for now (I don't think I can look today).

I think this fix is good and needed but something else might also be required.

(Although I'm a bit surprised I didn't see this while testing.)

neicnordic / crypt4gh

Bugfix; incorrect length was used to check for discarded areas #103

Codecov Report