sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
476 stars 79 forks source link

how do we get `scaled` from a `RevIndex`? #3386

Closed ctb closed 5 days ago

ctb commented 1 week ago

from slack -

me:

is there a standard or correct way to get the scaled value from a rocksdb revindex? I know they’re made from a CollectionSet so they have to have a single scaled, I believe. But I don’t see any standard way to get at it. (running into some …challenges with scaled and manysearch against a rocksdb)

luizirber:

Hmm, I guess you can grab the manifest and read a record from it, but that doesn’t mean that’s the scaled used for the DB, because it can be downscaled So probably good to save it in the METADATA column, like the manifest is

me:

on reflection, I think we should require that a single scaled be used for the database, in which case either approach would work. Would prefer to avoid changing the METADATA, yah? happy to be told otherwise. just a hot take.

here’s what I tried out - seems to resolve issues: https://github.com/sourmash-bio/sourmash_plugin_branchwater/pull/504/files#diff-c9cd745a5c795bf1a3ca0d53653a6c2381d33aa49a04218343373e93fc82d830 what do you think about (1) fixing scaled for a database when constructing, and (2) provide a scaled() on the RevIndex?

code in manysearch_rocksdb.rs over in sourmash_plugin_branchwater:

  // grab scaled from the database.
    let max_db_scaled = db
        .collection()
        .manifest()
        .iter()
        .map(|r| r.scaled())
        .max()
        .expect("no records in db?!");

    let selection_scaled: u32 = match selection.scaled() {
        Some(scaled) => {
            if *max_db_scaled > scaled {
                return Err("Error: database scaled is higher than requested scaled".into());
            }
            scaled
        }
        None => {
            eprintln!("Setting scaled={} from the database", *max_db_scaled);
            *max_db_scaled
        }
    };
ctb commented 1 week ago

what do you think about (1) fixing scaled for a database when constructing, and (2) provide a scaled() on the RevIndex?

luiz:

Yes for both, and I’m OK stuffing more into METADATA because this is at the db level, not sig

ctb commented 1 week ago

feels like this gels a bit with #3387, which is updating Manifest records with scaled from select.

I think the idea would be to make sure that RocksDB-revindex are built only from CollectionSet (which I think is already the case) and then upgrade CollectionSet to make sure there is only one scaled (it already checks ksize, moltype, etc.)

Looking at the code, we might want to introduce an explicit select in the try_from, since with the updates from #3387 that would update the scaled.

impl TryFrom<Collection> for CollectionSet {
    type Error = crate::Error;

    fn try_from(collection: Collection) -> Result<Self> {
        let first = if let Some(first) = collection.manifest.first() {
            first
        } else {
            // empty collection is consistent ¯\_(ツ)_/¯                        
            return Ok(Self { collection });
        };

        collection
            .manifest
            .iter()
            .skip(1)
            .try_for_each(|c| first.check_compatible(c))?;

        Ok(Self { collection })
    }
}

impl CollectionSet {
    pub fn into_inner(self) -> Collection {
        self.collection
    }

    pub fn selection(&self) -> Selection {
        todo!("Extract selection from first sig")
    }
...
}
ctb commented 1 week ago

ok, I'm hesitating to run a select as that will change the function signature - we'd either need to do a clone or change the collection.

ctb commented 1 week ago

In #3397, I add min_max_scaled() and enforce that they are the same when converting a Collection into a CollectionSet.