opalj / opal

https://www.opal-project.de
Other
51 stars 27 forks source link

Roundtrip Bytecode Modification fails when using `ba.LabeledCode` #242

Open johannesduesing opened 8 hours ago

johannesduesing commented 8 hours ago

During the current iteration of our project course on static program analysis, one of our students discovered what i believe is a bug in abstract interpretation. The students were working on a task where they had to change bytecode of certain methods, and then write the modified bytecode to files. I know there is multiple ways to do this, but one way that seems particularly suited is the org.opalj.ba.LabeledCode builder. However, students discovered that the simple round-trip of reading code and writing it again without modifications using LabeledCode does not work for all methods of real-world projects. For example, it fails on 4% of all methods from Apaches PDFBox 3.0.3 (link to jar). Here is an example that exercises a specific method for which the roundtrip fails:

import org.opalj.ba.LabeledCode
import org.opalj.br.ObjectType
import org.opalj.br.analyses.Project

object PdfBoxDemo extends App {

  val pathToPdfBox = "pdfbox-app-3.0.3.jar"
  val project = Project(new java.io.File(pathToPdfBox))

  project.classFile(ObjectType("org/bouncycastle/x509/X509CertificatePair")) match {
    case Some(cf) =>
      val theMethod = cf.methods.find(_.name == "equals").get
      val builder = LabeledCode(theMethod.body.get)
      val newCode = builder.result(cf.version, theMethod)._1
      val newMethod = theMethod.copy(body = Some(newCode))
      println(s"Successfully re-serialized bytecode: ${newMethod.body.get}")
    case _ =>
  }
}

The exception that is thrown is very long and very detailed (pdfbox-error.txt), but boils down to the following:

@errt do you have any insights into this? Is the LabeledCode object intended to be used this way, did we do something not supported or is this actually a bug? If so, i'd look into it in the coming days.

Credit to Finn for reporting this issue 👍

errt commented 7 hours ago

Sadly I don't have much insight into the abstract interpretation code. Can you give me some hints where that ReferenceValue method is called from? If that method really leads to this, that seems strange. In that case, this would also produce an InitializedObjectValue, right? And that would contradict the method's name.

Difficult to say where the underlying problem is. Could be a mixin issue with the two domains, but could also be wrong implementation in either. Especially org.opalj.ai.domain.l0.TypeCheckingDomain looks strange: The ArrayValue method returns a (potentially-null) DefaultArrayValue, but the related ObjectValue methods returns a (non-null) InitializedObjectValue. The other ObjectValue method returns a (potentially-null) DefaultMObjectValue only if the upperTypeBound has multiple types, but a (non-null) InitializedObjectValue (via the ObjectValue method), which seems strange since I don't think a singleton upperTypeBound would exclude null either.

johannesduesing commented 6 hours ago

Sure i can try to give you some context on this:

I again confirmed via debugging that this in fact calls TypeCheckingDomain.ObjectValue and leads to an initialized object value, thus no field can ever be nullable (in this mix-in configuration for TypeCheckingDomain). So to me this seems like an implementation issue in TypeCheckingDomain, right? Unfortunately i don't have any idea how fields are usually handled in AI, but this seems factually wrong. This is an interesting quote from the code-comments in TypedValueFactory.TypedValue:

/**
     * Factory method to create domain values with a specific type. I.e., values for
     * which we have some type information but no precise value or source information.
     * However, the value is guaranteed to be `null` or properly initialized.
     *
     * For example, if `valueType` is a reference type it may be possible
     * that the actual value is `null`, but such knowledge is not available.
     *
     * The framework uses this method when a method is to be analyzed, but no parameter
     * values are given and initial values need to be generated. This method is not
     * used elsewhere by the framework.
     */

This comment seems to specifically acknowledge that value can be null here, but the actual implementation in TypecheckingDomain contradicts this.

errt commented 5 hours ago

Note that the code we're discussing about is from this commit and unchanged since then. Interestingly, here, code in TypeLevelReferenceValues that used to produce an InitializedObjectValue was changed to ObjectValue which then, via TypeCheckingDomain ends up being an InitializedObjectValue again. But, at the time, that did not have isNull = No (at least not explicitly). That only got introduced in this commit, which, interestingly, is about fields. Might still be wrong or point to wrong mixing in of some field related domain to handle this first/differently for fields?