ndmitchell / shake

Shake build system
http://shakebuild.com
Other
771 stars 118 forks source link

Make "Reading from ByteString, insufficient left" more helpful? #781

Open omnibs opened 3 years ago

omnibs commented 3 years ago

We run into this error with some frequency, and so far believed it to be:

1) write thing to the cache (like an oracle) 2) change thing's format (like the oracle's data type) 3) run shake again and the mismatch throws this error

But we ran into one scenario today where we gave up. We added versioned to just about everything we could think of, and ended up nuking the cache.

I'm not sure if this is the steps I outlined are indeed when this happens, or if they are some of the possible scenarios, just not all, but we could use some help from Shake itself figuring out what to do when we hit this error.

ndmitchell commented 3 years ago

Shake is meant to detect that the database is corrupted, wipe it, and rebuild from scratch automatically. Do you ever see that behaviour? Can you give an example of the kind of thing you are writing to the cache? e.g. is it a record? How did you obtain a Binary instance for it? Are you removing/adding fields?

omnibs commented 3 years ago

Do you ever see that behaviour?

I don't think we ever saw it. Is it a new-ish feature? We're pinned to this revision https://github.com/ndmitchell/shake/tree/6297471621582d87d1e983dd9e04ed02c62beb8f

Does shake output anything different when it detects the database is corrupted and decides to wipe it?

Can you give an example of the kind of thing you are writing to the cache? e.g. is it a record? How did you obtain a Binary instance for it? Are you removing/adding fields?

I'm having a hard time reproing this again, but I think it was related to one of our Oracles. We have a few of them returning everything from records to union types, and they have Binary instances from instance Binary [type], like this one:

data TestFileFormatted = TestFileFormatted FormatterType FilePath
  deriving (Show, Eq, Generic)

instance Hashable TestFileFormatted

instance Binary TestFileFormatted

instance NFData TestFileFormatted

type instance RuleResult TestFileFormatted = IsFormatted

testFileFormattedOracle :: TestFileFormatted -> Action IsFormatted
testFileFormattedOracle (TestFileFormatted formatterType file) = do
  -- Take a dependency on the version of the formatter, so we rerun this rule
  -- when it changes.
  void $ askOracle (FormatterVersion formatterType)
  need [file]
  let formatter = formatterFor formatterType
  check formatter file

data IsFormatted
  = Formatted
  | Unformatted String
  deriving (Eq, Generic, Show)

instance Hashable IsFormatted

instance Binary IsFormatted

instance NFData IsFormatted

When I run into this again I'll try to collect more info and share here. Any pointers on what data I should collect are appreciated!

omnibs commented 3 years ago

I've run into another incarnation of this problem, where Shake doesn't detect the database is corrupted and we get runtime errors decoding binary instances. I'm not sure how related the two are, but this is he stack trace of the 2nd kind of error I get:

  at apply1, called at src/Development/Shake/Internal/Rules/Oracle.hs:159:32 in shake-0.19-Kan5k6lRCGEH1JkmyaVtkS:Development.Shake.Internal.Rules.Oracle
* Depends on: OracleQ (AllServices ())
  at error, called at libraries/binary/src/Data/Binary/Get.hs:351:5 in binary-0.8.6.0:Data.Binary.Get
* Raised the exception:
Data.Binary.Get.runGet at position 2316: not enough bytes

This is the relevant code:

rules :: Rules ()
rules =
  versioned 6 $ void $ addOracleCache allServicesOracle

newtype AllServices = AllServices ()
  deriving (Show, Typeable, Eq, Hashable, Binary, NFData)

type instance RuleResult AllServices = [Service]

allServicesOracle :: AllServices -> Action [Service]

data Service
  = Service
      { serviceName :: String,
        builders :: Builders,
        codedeploy :: [Codedeploy],
        kubernetes :: Maybe Kubernetes,
        localPort :: Maybe Dhall.Natural
      }
  deriving (Generic, Show, Typeable, Eq)

instance Binary Service

instance Hashable Service

instance NFData Service

instance Dhall.Interpret Service

data Builders
  = Builders
      { elm :: Maybe Elm,
        haskell :: Maybe Haskell,
        ruby :: Bool
      }
  deriving (Generic, Show, Typeable, Eq)

instance Binary Builders

instance Hashable Builders

instance NFData Builders

instance Dhall.Interpret Builders

This was the relevant change:


data Builders
  = Builders
      { elm :: Maybe Elm,
-        haskell :: Maybe Haskell
+        haskell :: Maybe Haskell,
+        ruby :: Bool
      }
  deriving (Generic, Show, Typeable, Eq)

We always thought this was expected behavior and instructed folks to bump the versioned on the rule whenever this happens.

saurabhnanda commented 3 years ago

I have been scratching my head over this issue for the last two days now:

Error when reading Shake database _build/.shake.database
  Reading from ByteString, insufficient left
  CallStack (from HasCallStack):
    error, called at src/General/Binary.hs:50:86 in shake-0.18.5-5KeTQg6YWjFdK6sR2J2BI:General.Binary
All files will be rebuilt

Here's what I am doing:

  1. Fire-up an ephemeral cloud server
  2. Run the shake build
  3. Save the cloud server as an image, and then, delete the ephemeral server.
  4. Create a new cloud server from the saved image
  5. Run shake on the newly created server
  6. Whenever shake/build is run for the first time on a newly created server, this error is reported.

I have a copy of the shake database that causes this error, if that helps.

saurabhnanda commented 3 years ago

I can reproduce this error predictably now:

  1. Run shake build on a cloud server
  2. Power off the cloud server (don't terminate)
  3. Power on the cloud server
  4. Run shake build again
  5. Error occurs

Is there any way to get more debugging information? Which field within the DB is corrupted, exactly? Is there anything being stored in the DB that is related to the underlying OS/machine (that could cause a corruption by simply stopping/starting the underlying machine)?

saurabhnanda commented 3 years ago

I tried running my Shakefile using the --trace option and it seems to be reading the shake DB in some sort of "chunks". Is there any way to print which chunk is causing this error?

saurabhnanda commented 3 years ago

Can reading env variables using unsafePerformIO have anything to do with this? For example:

envBranch :: String
envBranch = unsafePerformIO $ Env.getEnv "CI_COMMIT_REF_NAME"

Although, I'm not sure if this is landing up in the shake DB, or not.

saurabhnanda commented 3 years ago

Update: It seems that the underlying disk of the VM was actually getting corrupted, because I was using an unsafe method to create a VM snapshot. After fixing the snapshotting process (i.e. shut down the VM before taking a snapshot) I don't think I have encountered this error with Shake.