ndreey / ghost-magnet

Molecular Bioinformatics BSc thesis project at University of Skövde
MIT License
1 stars 0 forks source link

GHOST-MAGNET

Excuse the mess

Untill the completion of the project the repository will be messy with undocumented files and folders. A thorough clean up of files and structures will be commenced towards the end of the project. However, the Issues page can be seen as a journal, to learn more about the thought process for each decision.

workflow_final2

Evaluation of Host Contamination Removal Post Sequencing

This repository contains the code and data for my BSc thesis project in Molecular Bioinformatics at the University of Skövde. The project evaluated the removal of host contamination post sequencing of non-model organisms.

Project Aim

The aim of this project was two-fold:

  1. To determine how different levels of host contamination affects assembly and genome binning, and whether a significant correlation exists.
  2. To evaluate if removal of reads mapping to host bins is a valid method for host contamination depletion post sequencing.

image

Study Design

This project is an exploratory study that aims to determine if the method of host contamination removal has validity and should be explored further to generate a package/program that classifies and removes host bins, and re-runs metagenomic analysis.

Repository Contents

This repository contains the following files:

Subdirectories

Within the code/ directory, the following subdirectories serve specific functions:

Within the data/ directory, the following subdirectories are further categorized:

Additionally, the exploratory/ subdirectory is designated for storing files, notes, papers, and other various files that can be of interest to the project. In contrast, files determined relevant to the final report are placed in the submission/ subdirectory.

Using GitHub for Scientific Data Analysis

The development of software is a complex process that requires managing various versions of code, tracking progress, and ensuring that the final product works as intended. These criteria are similarly applicable and helpful for scientific data analysis, where reproducibility is of great importance. GitHub is an online platform that allows developers to create repositories for storing and managing code changes.

The platform enables changes made to the code to be tracked, with the possibility to revert to previous versions if necessary. It also provides the potential for managing different versions of the code for different environments. With the intention of ensuring reproducibility and documentation, the main idea was to create a GitHub repository for each step of the data analysis. For the mock data simulation, “01_clean_mock_data” was created.

With the GitHub repository, the project will also align with the FAIR principles described by (Wilkinson et al., 2016), which aim to make data and code findable, accessible, interoperable, and reusable. Further, when working with high-performance computing (HPC), it is essential to carefully optimize and test the programs to ensure a clean run on the server. By utilizing the local Git repository, thorough tests of scripts, programs, and program configurations can be determined and adjusted for using a subset of the data before the highly resourceful run on the server. The use of a local Git repository also facilitates the uploading of the correct file structure for the proper execution of programs without significant modifications to paths.

Finally, with the use of the issue feature on GitHub, one can track, report and resolve problems or tasks related to the project. The feature is helpful for collaborative development, where multiple contributors are involved in different aspects. Nonetheless