junnlikestea / vita

A tool to find subdomains or domains from passive sources.
The Unlicense
107 stars 16 forks source link

Flush results to output as they are fetched #31

Closed dee-see closed 3 years ago

dee-see commented 3 years ago

Currently the tool fetches all subdomains and at the end prints them all to stdout. When running again targets with very large amounts of subdomains, this allocates a very large vector for subdomains before calling cleaner.clean(subdomains). This is made worse by the fact that I'm running Vita on a 2 GB VPS which simply can't handle it and crashes.

vita -d comcast.net results in memory allocation of 1610612736 bytes failed I don't know if there's something going wrong with the error message, but that's quite a bit of memory!

Flushing results to output as they are fetched would solve that, however it might make it difficult to output only unique results. Personally I wouldn't mind a --flush switch that outputs duplicated results.

junnlikestea commented 3 years ago

vita -d comcast.net results in memory allocation of 1610612736 bytes failed I don't know if there's something going wrong with the error message, but that's quite a bit of memory!

My guess is that because comcast.net returns an absurd amount of results that we're probably actually running out of memory to allocate it.

Flushing results to output as they are fetched would solve that, however it might make it difficult to output only unique results. Personally I wouldn't mind a --flush switch that outputs duplicated results.

Yea I think this is a good idea, maybe the solution is to make Runner.run method to return a stream, and the PostProcessor.clean method to just return an iterator over the filtered results.

Depending on the cli flag we then either remove the duplicates by collecting the iterator into a HashSet or just write it to stdout with a BufWriter. What do you think?

dee-see commented 3 years ago

I think that sounds great!

junnlikestea commented 3 years ago

Another thing I noticed while digging into this issue was that I was allocating another large vec for the SonarSearch results coming over grpc. https://github.com/junnlikestea/vita/blob/3782231ede2da49e98cf915e6c638725c7cabb04/crobat/src/lib.rs#L48-L63 Because the type returned by the line below implements the Stream trait we could probably just return that and avoid all those extra allocations. https://github.com/junnlikestea/vita/blob/3782231ede2da49e98cf915e6c638725c7cabb04/crobat/src/lib.rs#L56 So the method would look something like:

    pub async fn get_subs(&mut self, host: Arc<String>) -> Result<impl Stream<Item = std::result::Result<Domain, Status>>> {
        trace!("querying crobat client for subdomains");
        let request = tonic::Request::new(QueryRequest {
            query: host.to_string(),
        });
        debug!("{:?}", &request);

        let stream = self.client.get_subdomains(request).await?.into_inner();
        Ok(stream)
    }