ncbi / ngs-tools

Other
103 stars 25 forks source link

Relationship between the STAT analysis data available on the NCBI SRA Run Browser and that on the cloud platform #36

Open Junna-Kawasaki opened 6 months ago

Junna-Kawasaki commented 6 months ago

I am writing to seek your assistance with a question regarding the Cloud-based Taxonomy Analysis Information Table.

I noticed a discrepancy between the “identified_spot_count” available on the cloud platform and the "IDENTIFIED READS" displayed on the Sequence Read Archive Run Browser. For instance, in the case of ERR979125 on the Run Browser, 97.1% of the reads are listed as being of human origin (https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=ERR979125&display=analysis).

However, the data retrieved from the cloud shows the following:

Regardless of whether the denominator is the “analyzed_spot_count” or the “total_spot_count”, the percentages are significantly lower than those reported on the Run Browser (16.0% and 0.24%).

Could you kindly clarify the relationship between the data available on the Run Browser and that on the cloud platform?

I appreciate your assistance and look forward to your response.

multikengineer commented 6 months ago

Junna-Kawasaki,

Could you kindly clarify the relationship between the data available on the Run Browser and that on the cloud platform?

If you look at the row in the cloud table tax_analysis_info for ERR979125 you will find it was an aligned submission of human where only the unaligned spots were analyzed. In such a case Run browser takes the total spot count 789675109, subtracts the analyzed spot count 1214952 and adds that sum (the aligned spot count) to the human spot count. Thus the denominator used in Run browser is different: it is the identified spot_count + the additional human aligned spot count which compose the majority of spots in this sample.
Does that explain it clearly?

Junna-Kawasaki commented 6 months ago

Thank you for your prompt response.

I hope it's not too much trouble, but I have a few additional questions.

I would like to know the percentage of identified_spot_count (i.e., spot matching a certain organism) and unidentified_spot_count (i.e., spot not matching any organism) for each SRA data. However, the Cloud-based Taxonomy Analysis Information Table does not provide data in the unaligned_spot_count column for most SRA data.

Therefore, I considered calculating the percentage of identified_spot_count per sample and assuming that the remaining spots fall under unaligned_spot_count. Is it possible to estimate the unaligned_spot_count percentage in each SRA dataset using this approach?

In your previous explanation, it was stated that identified_spot_count does not contain spots aligned to the human genome, which leads me to believe that this method might not work well for some samples.

I apologize for the numerous questions, but I would greatly appreciate any advice you could provide on calculating the percentage of unidentified spots that did not match anything using STAT.

Thank you very much for your assistance.

multikengineer commented 5 months ago

@Junna-Kawasaki , apologies for this tardy reply.

Therefore, I considered calculating the percentage of identified_spot_count per sample and assuming that the remaining spots fall under unaligned_spot_count. Is it possible to estimate the unaligned_spot_count percentage in each SRA dataset using this approach?.

identified_spot_count is the number of spots where a taxon was deduced (assigned): analyzed_spot_count is the total number of spots subject to tax analysis: therefore, it is not unreasonable to consider that total_spot_coint - analyzed_spot_count = unaligned_spot_count.

I would greatly appreciate any advice you could provide on calculating the percentage of unidentified spots that did not match anything using STAT.

That calculation is simply analyzed_spot_count - identified_spot_count. However at the moment you will find many STAT analysis results have null identified_spot_count . While we hope to backfill those values in the somewhat near future,,you can simply sum the total_count of two specific taxa: tax_id = 31567; name = 'cellular organisms' and tax_id= 10239, name = 'Viruses': the sum of those will equal 'identified_cpot_count'. If you did this for any aligned submission those spots that were aligned would not be considered.